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Abstract 

An (n, k)-Poisson Multinomial Distribution (PMD) is the distribution of the sum of n inde¬ 
pendent random vectors supported on the set Bk = {ei,..., Ck] of standard basis vectors in 
We prove a structural characterization of these distributions, showing that, for all e > 0, any 
(n, A:)-Poisson multinomial random vector is £-close, in total variation distance, to the sum of a 
discretized multidimensional Gaussian and an independent (poly(fc/£), fc)-Poisson multinomial 
random vector. Our structural characterization extends the multi-dimensional CLT of [VVllj . 
by simultaneously applying to all approximation requirements £. In particular, it overcomes 
factors depending on logn and, importantly, the minimum eigenvalue of the PMD’s covariance 
matrix. 

We use our structural characterization to obtain an £-cover, in total variation distance, of 
the set of all (n, fc)-PMDs, significantly improving the cover size of |DP081 IDP15) , and obtaining 
the same qualitative dependence of the cover size on n and £ as the k = 2 cover of [DP091 IDP14| . 
We further exploit this structure to show that (n, fc)-PMDs can be learned to within £ in total 
variation distance from Ofe(l/£^) samples, which is near-optimal in terms of dependence on e and 
independent of n. In particular, our result generalizes the single-dimensional result of [DDS12| 
for Poisson binomials to arbitrary dimension. Finally, as a corollary of our results on PMDs, 
we give a Ofe(l/£^) sample algorithm for learning (n, A:)-sums of independent integer random 
variables (SIIRVs), which is near-optimal for constant k. 
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1 Introduction 


Poisson Multinomial Distributions (PMDs) are one the most basic nonparametric multidimensional 
families of distributions. They express the distribution of how many out of n thrown balls will fall 
into k bins, when the balls (perhaps because of weight or other characteristics) have different biases 
towards falling into the different bins. Mathematically, a (n, /c)-PMD is the distribution of the sum 
of ^ independent random vectors Xi supported on the set Bk = {ei,... ,ek} of standard 
basis vectors in In particular, a (n, /c)-PMD requires for its description n- {k — 1) probabilities, 
specifying the distribution of each summand random vector. 

In this paper, we advance our understanding of the structure and learnability of this fundamental 
family of distributions by studying the following questions: 

1. Can we approximate PMDs via simpler distributions such as multi-dimensional Gaussians or 
Poissons? Do they always “behave as” discretized multi-dimensional Gaussians or Poissons? 
If not, what is the range of possible “behaviors” that PMDs may exhibit? 

2. Given n, k and e, is there a small set of distributions that e-cover, in total variation distance, 
the set of all (n, A:)-PMDs? And, how does the size of the cover scale with n, k and e? 

3. How many samples from a (n, A:)-PMD do we need to learn its density to within e in total 

variation distance? What is the dependence of the learning complexity on the size of 

their support? 

Structure of PMDs It is hard to do justice to the probability literature studying Question [1] 
The multi-dimensional CLT informs us that the limiting behavior of (n, fe)-PMDs, as n —)• -|-oo, 
is Gaussian, under conditions on the eigenvalues of the summands’ covariance matrices; see, e.g., 
[VdVnnj PI The CLT is quantified for finite n by the multi-dimensional Berry-Esseen theorem, which 
bounds the difference between the probability masses assigned to convex (or a bit more general) 
subsets of by a (n, A:)-PMD and the multi-dimensional Gaussian distribution with the same 
mean vector and covariance matrix, with the bound’s quality typically degrading as the PMD’s 
covariance matrix tends to singularity; see, e.g., [Ben05] . More recently, Valiant and Valiant [VVll] 
provide a bound in total variation distance, between a (n, fc)-PMD and the corresponding discretized 
multi-dimensional Gaussian, whose quality degrades mildly with n and worse with the minimum 
eigenvalue of the PMD’s covariance matrix (see Theorem [6])H Finally, older results using Stein’s 
method bound the total variation distance between a (n, /c)-PMD and a multivariate Poisson |Bar88[ 
IDP88| . or a (bona hde) multinomial distribution |Loh92| . 

In summary, known bounds show that a (re, /c)-PMD can be approximated by simpler, poly(A:)- 
parameter, distributions, but the quality of their approximation depends on the first few moments 
of the PMD or its summands. Our goal instead is to provide universal approximation theorems 
showing how to approximate a given (re, A:)-PMD by simpler distributions for any desired approx¬ 
imation e and without assumptions about the moments of the PMD or its summands. Our main 
structural theorem is the following. 


^When we approximate some (n, fc)-PMD or refer to the eigenvalues of its covariance matrix, we typically project 
the PMD onto a (fc — l)-dimensional space, e.g. by excluding one of its coordinates, as otherwise the covariance 
matrix always has a 0 eigenvalue and the distribution does not have full-dimensional support. 

^Notice that bounds on total variation distance are stronger than bounds on the probabilities of all events defined 
by convex sets in that Berry-Esseen-type theorems establish. 
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Theorem 1 (PMD Structure). For all n,k G N, and all £> 0, a {n,k)-Poisson multinomial 
random vector is e-close, in total variation distance, to the sum of a discretized multidimensional 
Gaussian and an independent (poly(/c/e), A;)-Poisson multinomial random vector. 

By introducing the independent (poly(A:/e), A:)-PMD, our structural result side-steps the degrada¬ 
tion of the CLT bound of [VVllj with log re and the smallest eigenvalue of the PMD’s covariance 
matrix, correcting it to any desired approximation e. Interestingly, there may be directions where 
the variance of the discretized Gaussian used in our result may be arbitrarily far from that of 
the approximated PMD. The sparse PMD added to the Gaussian serves to correct the variance 
in those directions, but does so in a correlated manner across several directions. Moreover, while 
[VVllj discretize their approximating multidimensional Gaussian to the closest lattice point, our 
discretization is more faithful to the structure of its covariance matrix; see Definition [Gj We pro¬ 
vide more intuition about our structural result in Section m where we also outline its proof. A 
more detailed proof of Theorem [1] appears in Section [3| and a more detailed statement is given as 
Theorem [5l 

Covers for PMDs Building covers for (re, A;)-PMDs was pursued in [DPOSj IDP15| as a means 
to develop approximation algorithms for Nash equilibria in anonymous games. These are games 
where re players share the same action set, say {1,... , k}, and each player’s utility depends on their 
own choice of action as well as the distribution of how many of the other players choose each of 
the available actions, but players’ utility functions may otherwise be different. It was shown that 
proper e-covers, in total variation distance, of (re, A:)-PMD^ imply approximation algorithms for 
Nash equilibria in these games, whose complexity scales with the size of the cover. Intuitively, 
this is because switching from a mixed Nash equilibrium to a mixed strategy profile with the same 
distribution of how many players choose each action does not affect players’ payoffs by more than e. 

The covers for (re, /c)-PMDs obtained in the anonymous games papers cited above have size: 

re V V ® / J ^ where f{k) < ~^^k\ 

Such covers are of theoretical interest, their interesting feature being that the size is polynomial 
in re. Indeed, the standard discretization of the parameters of a PMD’s constituent vectors results 
in covers of size exponential in re, so a more delicate “global” discretization is needed to obtain 
covers whose size is polynomial in re. 

Besides providing an asymptotically smaller search space for Nash equilibria in anonymous 
games, or any other optimization problem over PMDs, the polynomial rather than exponential 
dependence of the cover size on re has direct consequences to the learnability of these distributions; 
see Theorem [7] (from |DK14j i and |A.IOS14] for a similar result, which improve a long line of similar 
results in the probability literature |DL01j . In particular, a cover of polynomial size implies directly 
that these distributions can be learned from a number of samples logarithmic in re, despite their 
support being polynomial in re. Motivated by such applications of covers to algorithms and learning 
we use our structural result to obtain an improved cover theorem. 

Theorem 2 (PMD Covers). For all re, /c G N, and e > 0, there exists an e-cover, in total variation 
distance, of the set of all (re, k)-PMDs whose size is 

■ min |2P°h(fc/e)^ | . 

® An e-cover of a set of distributions P is called proper iff C P. 
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We make a few remarks about our cover. First, the cover is non-proper, containing distributions 
that are of the form specified in Theorem [U i.e. are convolutions of a discretized Gaussian and 
a PMD. Moreover, it is straightforward to see that any cover has size at least and at least 

For the first lower bound, count the number of (n, A:)-PMDs whose summands are deter¬ 
ministic. For the second, count the number of (1, /c)-PMDs whose probabilities are integer multiples 
of e. So, for fixed k, our bound has the right qualitative dependence on n (namely polynomial), 
and a near-right dependence on 1/e (namely quasi-polynomial rather than polynomial). Moreover, 
it obtains the same qualitative dependence on n and e as the k = 2 cover of [DP09t [DP14| . namely 
polynomial in n and quasi-polynomial in 1/e. 

Learning PMDs In view of tools for hypothesis selection from a cover (see, i.e., Theorem[7]), our 
cover theorem directly implies that (n, /i;)-PMDs can be learned from 0{k/'^ ■ logn • log^^^(l/e)/e^) 
samples. These are near-optimal in terms of e, as ll{k/e‘^) samples are necessary even for learning 
a (1,A:)-PMD. We show that the dependence on n can be completely removed from the learner, 
generalizing the results on Poisson Binomial Distributions [DDS12| . 

Theorem 3 (PMD Learning). For all n. A: S N and e > 0, there is a learning algorithm for {n, k)- 
PMDs with the following properties: Let X = be any {n,k)-Poisson multinomial random 

vector. The algorithm uses 


min ^0{k^^ ■ log^’''^(l/e)/e^),poly(/i;/e)| 

samples from X, runs in 

min 1| , 

and with probability at least 9/10 outputs a (succinct description of a) random vector X such that 
dTy{X,X) < e. 


Additional Results: Learning fe-SIIRVs A (n,A:)-SIIRV is the sum of n independent (single¬ 
dimensional) random variables supported on {0, ... ,k — 1}. SIIRVs generalize Poisson Binomial 
distributions, which correspond to the case k = 2. At the same time, SIIRVs can be viewed as 
projections of PMDs onto the vector (0,1,... , A; — 1). In particular, if A is a (n, A:)-SIIRV, there 
exists a (n, A:)-Poisson multinomial random vector Y, such that X = {0,1,... ,k — 1)"*" • Y. 

Recent work has established that (n, A:)-SIIRVs can be learned from poly(A:/e) samples, inde¬ 
pendent of n, when even learning a (1,A:)-SIIRV already requires Q.{k/e^) samples |DDO+13| . A 
question arising from this work is finding the optimal dependence of the sample complexity on e. 
Demonstrating the expressive power of PMDs, as a corollary of our cover result, we show that the 
optimal dependence is actually Ok{l/e‘^). 

Theorem 4 (SIIRV Learning). For all n, A: G N and e > 0, there is a learning algorithm for 
{n,k)-SIIRVs with the following properties: Let X = be any {n,k)-SILRV. The algorithm 

uses k^^ ■ 0(log^^^(l/e)/e^) samples from X, runs in time 
at least 9/10 outputs a random vector X such that dT:y{X,X) < e. 

Simultaneous work by Diakonikolas, Kane and Stewart [DKS15| takes a direct approach to 
solving this problem. Using Fourier-based methods, they give a polynomial-time algorithm which 
requires 0{k/e^) samples, obtaining near-optimal dependence on both k and e. 

^We work in the standard “word RAM” model in which basic arithmetic operations on 0(logn)-bit integers are 
assumed to take constant time. 
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1.1 Approach 

Structure The multi-dimensional nature of PMDs poses challenges in understanding their struc¬ 
ture. The projection of a (n, A;)-Poisson multinomial random vector onto each standard basis vector 
is a n-Poisson Binomial random variable, i.e. distributed as the sum of n independent indicators. 
Depending on our choice of e, the latter may be e-close (in total variation distance) to a dis¬ 
cretized Normal distribution (“heavy projection”) or a distribution whose essential support is a 
length 0(l/e^) subinterval of {0,... ,re} (“light projection” ) |DP14j . Intuitively, one would like to 
aggregate all heavy projections into a discretized multi-dimensional Gaussian and all light projec¬ 
tions into a distribution of small support, independent of n. However, projections onto different 
standard basis vectors may be correlated, and they cannot be disentangled this simply. 

In fact, even if all projections of a PMD onto the standard basis vectors are heavy—even if 
they have variance super-polynomial in k/s, it is still unclear whether the PMD can always be well 
approximated by a discretized multi-dimensional Gaussian. In particular, the multi-dimensional 
GLT of Valiant and Valiant [VVllj (Theorem El) does pay a penalty that scales with log n. 

Finally, projections onto non-standard basis vectors may behave more erratically. As we pointed 
out earlier, the projection of a (n, A:)-PMD onto the vector v = (0,1,... ,k — 1) is a (n, A:)-SIIRV, 
which need not be log-concave or even unimodal, and could even exhibit “mod-structure” and be 
n-modal; think of the distribution of V -|- 2 • Z where Z is sampled from a Binomial(n, 0.5) and V is a 
Bernoulli(l/3). Whichever simpler distribution we identify to approximate a given (n, A;)-PMD thus 
needs to respect the potential mod-structure that the PMD’s projection onto v, its permutations 
or other integral vectors may exhibit. 

Our analysis sidesteps the difficulties identified above by showing that, for all e, n, k, a (n, k)- 
Poisson multinomial random vector is e-close to the sum of a discretized Gaussian and an in¬ 
dependent (poly(/c/e),/c)-Poisson multinomial random vector. Roughly speaking, the Gaussian 
absorbs the variance in the heavy dimensions, and explains the correlation between light and heavy 
dimensions, while the sparse PMD explains the remaining variance in the light dimensions. Of 
course, what dimensions are “light” and “heavy” in the above discussion depends on our desired 
approximation e. 

At the heart of our proof lies the aforecited GLT by Valiant and Valiant [VV11| , approximating 
a Poisson Multinomial by a discretized Gaussian. There are several issues with its application 
here: the accuracy of the approximation cannot be made an arbitrary e, but worse, it deteriorates 
(logarithmically) as we increase n or decrease the minimum eigenvalue of the covariance matrix of 
the PMD. The main intuition behind our structural theorem and the main technical roadblock for 
its proof lies in avoiding paying these two penalties. 

To mitigate the latter cost (corresponding to the smallest eigenvalue), we use a stripped down 
version of the trickle-down sampling procedure from [DP08] to round the parameters of our given 
PMD. This allows us to shift the parameters of the PMD’s constituent random vectors such that 
they are either equal to 0 or I, or sufficiently far from 0 or 1. A coordinated “rounding” of these 
parameters combined with a coupling argument and single-dimensional Poisson approximations 
allow us to argue that the effect of the rounding is small in the total variation distance of the 
resulting PMD compared to the original PMD. Each constituent random vector in the resulting 
PMD now has decent variance in every axis direction where it has non-zero variance. Partitioning 
the PMD’s constituent vectors into sets based on the axis directions where they have non-zero 
variance, we get that the minimum eigenvalue of each resulting sub-PMD is large in the span of 
these directions; see Proposition EPI Details about this step are given in Section IB.11 

®Again, as pointed out earlier, when we refer to the eigenvalues of the covariance matrix of a PMD spanning a 
certain subspace, we always project the PMD onto a subspace of one dimension less, as otherwise the covariance 
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To avoid paying the logarithmic cost in the value of n (the number of summands) which appears 
in the CLT, we repeatedly partition and sort the random vectors into buckets. The sub-PMD 
corresponding to each bucket will have the property that the logarithm of the number of summands 
is negligible compared to the minimum eigenvalue of its covariance matrix, so that we can apply 
the central limit theorem from |VV11] . We note that there will be a small number of random 
vectors which do not fall into a bucket that has this property - these leftover vectors result in the 
sparse Poisson Multinomial component in our structural result. Details about this step are given 
in Section IB.21 

The above approximations result in a distribution comprising several discretized Gaussians and 
a sparse Poisson multinomial. We subsequently merge all component discretized Gaussians into 
a single distribution. It is well-known that the sum of two Gaussians is another Gaussian whose 
parameters are equal to the sum of the parameters of its two components. The same is not true 
for discretized Gaussians, and we must quantify the error induced by this merging operation. More 
details are provided in Section IB.31 

Our structural results are described further in Section [3l 

Cover We provide two covers for (n, A:)-PMDs, which are advantageous for different regimes of k 
and e. The first cover follows directly from Theorem [5l which gives a structural characterization of 
a PMD as the sum of an appropriately discretized Gaussian and a (poly(A:/e), A:)-PMD. We simply 
take an additive grid over all the parameters of this characterization to achieve a cover size which 
is polynomial in n and exponential in k and 1/e. 

Similar to |DP14j . we can reduce the dependence of the cover size to pseudo-polynomial in 
1/e, albeit at an increased cost in k. This is done using a generalization of the moment matching 
techniques known for Poisson Binomial distributions. At a high level, this avoids the naive grid- 
ding over all (poly(A:/e), A;)-PMDs by filtering out the ones with unique “moment prohles,” which 
describe the first several moments of the distribution. We prove that any two distributions with 
matching moment profiles will have small total variation distance by leveraging results by Roos on 
Krawtchouk approximations to PMDs [Roon2] . 

A further description of our cover results is provided in Section HI 

Learning Our cover theorem (Theorem [2|) directly implies (using Theorem [7]) that (n,/c)-PMDs 
can be learned from 0{log N/e'^) samples, where N is the size of our cover. Given that N is 
polynomial in n, the resulting sample complexity is logarithmic in n. To remove the dependence 
on n from our sample complexity, we need to exploit not just the size but also the structure of the 
cover. 

In particular, we know from our structural characterization (TheoremH]) that any (n, A:)-Poisson 
Multinomial random vector is e-close to the sum of a discretized multi-dimensional Gaussian and 
an independent (poly(/c/e), /c)-PMD. The dependence of the cover size on n is due to enumerating 
over a cover of discretized multi-dimensional Gaussians, as enumerating over (poly(/c/e), A:)-PMDs 
has no dependence on n. The challenge is this: given sample access to an unknown (n, /c)-PMD 
can we zoom in to a smaller set of candidate discretized multi-dimensional Gaussians whose size 
is independent of n and which suffice for the purposes of guaranteeing an approximation to the 
unknown PMD? 

Let us start with an easier task. Suppose that our structural theorem decides that a (n, k)- 
PMD is e-close in total variation distance to a discretized multi-dimensional Gaussian. In this 
case, is it possible to recover the Gaussian from poly(A:/e) samples from the PMD? Intuitively the 

matrix always has a 0 eigenvalue since the distribution does not have full-dimensional support. 
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answer should be “yes,” as learning a multi-dimensional Gaussian to within e in total variation 
distance is feasible from 0{k/e^) samples. Only there are two complications. First, we are seeking 
to actually learn a discretized multi-dimensional Gaussian and, most importantly, we do not have 
sample access to the Gaussian, but a distribution that is e-close to it in total variation distance. 
The hrst complication becomes an issue when the covariance matrix of the Gaussian has minimum 
eigenvalue that does not scale with some poly(A:/e), which may very well be the case. The second is 
more severe as it necessitates robust estimators for the moments of a (discretized) multi-dimensional 
Gaussian that are resilient to an arbitrary movement of £ probability mass. We are not aware of 
such estimators even for a (continuous) multi-dimensional Gaussian. 

Despite these apparent issues, even in the simple case we are considering, the saving grace 
comes from a closer examination of the proof of our structural result. When our structural theorem 
deems a (n, A:)-PMD approximable by a discretized multi-dimensional Gaussian, we can argue 
that the covariance matrices S of the former and Tiq of the latter are spectrally close, satisfying 
\x^TiX — x'^TjqxI < £ ■ x'^TiX, for all x. So it suffices to learn the covariance matrix of the PMD to 
which we have direct sample access, thereby obviating the need for a robust estimator. Learning 
the covariance matrix of a PMD is feasible from poly(A:/e) samples by bounding the kurtosis of any 
projection of the PMD (Lemma [8]). 

The bigger challenge is generalizing the approach to when our structural theorem deems a (n, k)- 
Poisson Multinomial random vector X approximable by the sum of a discretized multi-dimensional 
Gaussian G and a (poly(/i:/e),/i;)-Poisson Multinomial random vector Y. We can enumerate over 
the latter, but enumerating over the former is too expensive (i.e. will incur a dependence on n). 
So we have to learn it with sample access to X. Unfortunately, our spectral approximation is now 
much weaker. The covariance matrices S of X and Hq of G are now related as follows, for all x: 
\x^TiX — < £ ■ x'^TiX -|- poly(/i:/e). Hence, for directions x where the variance of X is 

small, this approximation is quite loose to just approximate Sc with S. 

Our approach is instead to use samples from X to get a handle on the spectrum of Yq. As 
before, by bounding the kurtosis of any projection of the PMD, we can produce an estimate S 
that approximates S spectrally: for all x, |x"’"Sx — x"’"Sx| < e • x"’"Sx (Lemma [8]). Then, using 
Courant minimax principle through the proof of our structural result, we can argue that the i-th 
eigenvalue \f of Eg and A* of S are related as follows: |Ap — Ai| < 0(e)Ai -|- poly(A:/e). So, 
using the eigenvalues of our learned S, we can produce a small cover for the eigenvalues of Yq. 
Unfortunately, the corresponding eigenvectors of Yq and Y need not be as closely related, and it 
is not clear how to grid over those as the ratio of the smallest to the largest eigenvalue may be 
polynomial in n. We show how to use the knowledge of the eigenvalues and the spectral relation 
between Y and Yq to produce a small cover over matrices Yq (and not eigenvectors) such that at 
least one matrix in the cover spectrally approximates our target Yq. The details are provided in 
Section [D.31 At this point, we have a small cover over possible distributions Y and a small cover 
over possible discretized multi-dimensional Gaussians. So we can select among these hypotheses 
using Theorem [3 

Our learning algorithm is described in Section [5j 
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2 Preliminaries 


2.1 Parameters 

Throughout this paper, we will repeatedly refer to three key parameters, c = c(e,/c) = poly(e/A;), 
t = t{e, k) = poly(A:/e), and 7 = 0(1). We set 

/g2xl+5. 

for constants 6c,St,dj > 0 . 

2.2 Definitions 

We start by defining several of the distribution classes we will consider. First, and most importantly, 
we start with a formal definition of Poisson Multinomial Distributions. 

Definition 1. A A:-Categorical Random Variable (k-CRV) is a random variable that takes values 
in {ei,... ,efc} where ej is the k-dimensional unit vector along direction j. 7r(z) is the probability 
of observing e*. 

Definition 2. An (n. A:)-Poisson Multinomial Distribution ({n,k)-PMD) is given by the law of 
the sum of n independent but not necessarily identical k-CRVs. An {n,k)-PMD is parameterized 
by a nonnegative matrix tt G [ 0 , 1 ]"'^^ each of whose rows sum to 1 is denoted by M'^, and is 
defined by the following random process: for each row 7r(i, •) of matrix vr interpret it as a probability 
distribution over the columns of vr and draw a column index from this distribution. Finally, return 
a row vector recording the total number of samples falling into each column (the histogram of the 
samples). 

We note that a sample from an (n, A:)-PMD is redundant - given A:—1 coordinates of a sample, we 
can recover the final coordinate by noting that the sum of all k coordinates is re. For instance, while 
a Binomial distribution is over a support of size 2, a sample is 1-dimensional since the frequency 
of the other coordinate may be inferred given the parameter re. With this inspiration in mind, we 
define the Generalized Multinomial Distribution, which is the primary object of study in [VVll] . 

Definition 3. A Truncated A:-Categorical Random Variable is a random variable that takes values 
in { 0 , ei,..., efc_i} where ej is the {k — 1)-dimensional unit vector along direction j, and 0 is the 
{k — 1) dimensional zero vector. p{0) is the probability of observing the zero vector, and p{i) is the 
probability of observing e*. 

Definition 4. An (re,/c)-Generalized Multinomial Distribution ({n,k)-GMD) is given by the law of 
the sum ofn independent but not necessarily identical truncated k-CRVs. A GMD is parameterized 
by a nonnegative matrix p G [ 0 , 1 ]”^^^“^^ each of whose rows sum to at most 1 is denoted by G^, 
and is defined by the following random process: for each row p{i, ■) of matrix p interpret it as a 
probability distribution over the columns of p - including, /?(*, j) < 1 ; “invisible” column 

0 - and draw a column index from this distribution. Finally, return a row vector recording the total 
number of samples falling into each column (the histogram of the samples). 

For both (re, A:)-PMDs and (re, A:)-GMDs, we will refer to re and k as the size and dimension, 
respectively. 
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We note that a PMD corresponds to a GMD where the “invisible” column is the zero vector, 
and thus the definition of GMDs is more general than that of PMDs. However, whenever we refer 
to a GMD in this paper, it will explicitly have a non-zero invisible column. 

While we will approximate the Multinomial distribution with Gaussian distributions, it does 
not make sense to compare discrete distributions with continuous distributions, since the total 
variation distance is always 1. As such, we must discretize the Gaussian distributions. We will use 
the notation [x] to say that x is rounded to the nearest integer (with ties being broken arbitrarily). 
If X is a vector, we round each coordinate independently to the nearest integer. 

Definition 5. The k-dimensional Discretized Gaussian Distribution with mean /r and covariance 
matrix S, denoted [AA(/x,S)], is the distribution with support obtained by picking a sample 
according to the k-dimensional Gaussian AA(/i,S), then rounding each coordinate to the nearest 
integer. 

As seen in the definition of an (n, /c)-GMD, we have one coordinate which is equal to n minus 
the sum of the other coordinates. We define a similar notion for a discretized Gaussian. However, 
we go one step further, to take care of when there are several such Gaussians which live in disjoint 
dimensions. By this, we mean that given two Gaussians, the set of directions in which they have 
a non-zero variance are disjoint. Without loss of generality (because we can simply relabel the 
dimensions), we assume all of a Gaussian’s non-zero variance directions are consecutive, i.e., the 
covariance matrix is all zeros, except for a single block on the diagonal. Therefore, when we add 
the covariance matrices, the result is block diagonal. The resulting distribution is described in the 
following definition. 

Definition 6. The structure preserving rounding of a multidimensional Gaussian Distribution 
takes as input a multi-dimensional Gaussian N'in, B) with B in block-diagonal form. It chooses 
one coordinate as a “pivot” in each block, samples from the Gaussian ignoring these pivots and 
rounds each value to the nearest integer. Finally, the pivot coordinate of each block is set by taking 
the difference between the sum of the means and the sum of the values sampled within the block. 

3 Structure of PMDs 

In this section, we show a structural result, stating that any (n, /c)-PMD is close to the sum of an 
appropriately discretized Gaussian and a (poly(A:/e), A:)-PMD: 

Theorem 5. For parameters c and t as described in Section \2.1l every (n, k)-Poisson multinomial 
random vector is e-close to the sum of a Gaussian with a structure preserving rounding and a 
{tk‘^,k)-Poisson multinomial random vector. For each block of the Gaussian, the minimum non¬ 
zero eigenvalue of B, is at least ^. 

There are three main steps in the proof of this theorem. 

Step 1 First, we replace our (n, A:)-PMD with one where all parameters are sufficiently far from 0 and 
1, while still being close to the original in total variation distance. To motivate this operation, 
we introduce one of our main tools in our approach, the central limit theorem of Valiant and 
Valiant [VVllj . which approximates an (n, A;)-GMD by a discretized multivariate Gaussian. 

Theorem 6 (Theorem 4 from [VV10| 1. Given a generalized multinomial distribution G^, 
with k dimensions and n rows, let /r denote its mean and B denote its covariance matrix, 
then 

l4/3 

dTV (G^ B)1) < ^ • 2.2 • (3.1 + 0.83 log 
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where is the minimum eigenvalue o/S. 

We note that this has an error term which depends on the minimum eigenvalue of the co- 
variance matrix of the GMD. If we perform this rounding procedure and ignore any zero 
coordinates, then we are given the guarantee that the minimum eigenvalue will be sufficiently 
large. 

Recall that in Section [Q we have set c = poly(e/fc). This lemma summarizes the result of 
the rounding procedure: 

Lemma 1. For any c < given access to the parameter matrix p for an {n,k)-PMD M^, 
we can efficiently construct another {n, k)-PMD MP, such that, for all i,j, p{i,j) 0 (0,c), 
and 



The procedure starts by fixing two coordinates i and j, and considers all CRVs with a pa¬ 
rameter in i which is close to 0, and has maximum parameter in coordinate j. We move 
some of the weight in this “heavy” coordinate either to or from the “light” coordinate, while 
approximately preserving the overall mean vector of the set of CRVs. 

The analysis of this process uses a stripped-down version of the “trickle-down” process in 
|DPr)8] . This gives an approximate way to sample from a PMD, resulting in a distribution 
which is very close in total variation distance. While we postpone technical details to Sec¬ 
tion IB. 11 roughly speaking, it works as follows. First, take a sample from the PMD but 
disregard the values for its light coordinate i and heavy coordinate j. Instead, sample a new 
value for coordinate i according to a Poisson distribution with parameter pi, the mean value 
for coordinate i. Finally, set coordinate j to ensure that all coordinates of the sample sum 
to n. As mentioned before, the rounding process approximately preserves the value of pi, 
and thus this alternate sampling procedure is closely coupled for the rounded and original 
PMD. Thus, by triangle inequality, the rounded and original PMDs are close in total variation 
distance. 

We repeat this rounding procedure for each i and j, eventually leading to all parameters 
either being equal to or far from 0 and 1. A full description and analysis of the rounding 
procedure are in Section IB.11 

Step 2 Now, we have a “massaged” (n, A:)-PMD MP, with no parameters lying in the intervals (0, c) 
or (1 — c, 1). Next, we will show how to relate the massaged (n, A;)-Poisson multinomial ran¬ 
dom vector to a sum of k Gaussians with a structure preserving rounding plus a “sparse” 
(poly(A:/e), A:)-PMD. The general roadmap is as follows. We start by partitioning the con¬ 
stituent A:-GRVs into k sets. Si,... ,Sk, based on which basis vector we are most likely to 
observe. We work seperately for each set Si by considering the GMD formed by leaving out 
the coordinate i. Our goal is to use the GLT of Theorem [6] to bound the total variation dis¬ 
tance between the corresponding GMD and a discretized Gaussian with the same mean and 
covariance matrix. We must be careful when applying Theorem [6l since the bound depends 
on the size of the GMD. Instead of applying the theorem directly, to get a useful bound, 
we further partition the set Si into smaller subsets and apply the theorem to each of the 
resulting subsets. We can then “merge” the resulting discretized Gaussians together using 
the following lemma whose proof is given in Section IA.51 
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Lemma 2. Let Xi ~ AA(/xi,Si) and X 2 ~ Af{iJ, 2 ,'^ 2 ) be k-dimensional Gaussian random 
variables, and let a = miiij max* crjj where aij is the standard deviation of Xi in the direction 
parallel to the jth coordinate axis. Then 

dTV ( L^l + ^2l, L^ll + L^2l ) < 7^- 


In more detail, we partition each set Si into subsets, grouping together CRVs according 
to the dimensions they are non-zero in, i.e. set Sf contains all CRVs that are non zero in 
the coordinates given by set X C [/c] \ {i}. We then group these sets into buckets, where 
a set is assigned to a bucket depending on its cardinality; bucket gets all sets Sf with 

\Sf\ € [Bt,{l + with 7 = 0(1) and t = poly(/c/e) as defined in Section IXTl This 

bounds the ratio between the size and the minimum eigenvalue of the covariance of the GMD 
within every bucket other than This allows us to apply Theorem [ 6 ] and replace the CRVs 
within each bucket B^ for I > 1 with a discretized Gaussian, leaving us with a (poly(2^/e), k)- 
GMD consisting of all the CRVs of bucket B^. To reduce the number of remaining CRVs 
to polynomial in k, we show that by removing only poly(fc/e) of these CRVs, we can apply 
Theorem [ 6 ] again to the rest and obtain another discretized Gaussian. In particular, in 
Section [B. 2 1 we prove the following lemma: 

Lemma 3. Let GP° be the (| k)-GMD induced by the truncated GRVs in bucket B^. Given 

p^, we can efficiently compute a partition of B^ into S and S, where |5| < kt. Letting ps 

'■§ 

and S 5 he the mean and covariance matrix of the (liSI, k)-GMD induced by S, and he the 
{\S\,k)-GMD induced by S, 


j 8.646fc3/2log2/3(2fcf) 

dTV -■ 


Furthermore, the minimum non-zero eigenvalue ofTis is at least 


t£ 

k • 


After merging together all discretized Gaussians (at most one coming from each bucket B^ 
for all ^ > 0) by iteratively applying Lemma [21 we are able to approximate each original set 
of CRVs Si as the sum of a single discretized Gaussian and a (poly(fc/e), A:)-PMD. Gombining 
the result from each of the sets Si of the initial partition, we obtain the sum of k discretized 
Gaussians and a (poly(A:/e), A:)-PMD. The details of this step are described in Section [B.21 

Step 3 The final step is to show that the k discretized Gaussians can be merged into a single Gaussian 
with a structure preserving rounding. We note that we cannot apply Lemma[2]here, since each 
discretized Gaussian has a different pivot coordinate that has been left out. (Recall that by 
construction, the CRVs in set Si are approximated by a discretized Gaussian that leaves out 
coordinate i). We thus need a new tool to enable us to merge Gaussians defined in different 
dimensions. The main idea is that if two Gaussians with a structure preserving rounding 
overlap in some dimension, we can use the common dimension as the pivot. We then add the 
mean vectors and covariance matrices to merge the distributions. Iteratively repeating this 
process will merge all distributions which overlap in some coordinate. This leaves us with 
one or many discretized Gaussians that lie in completely disjoint coordinates which we can 
describe as a single Gaussian with a structure preserving rounding (defining blocks according 
to the coordinates spanned by each Gaussian). If these were (continuous) Gaussians, the 
swapping and merging operations would have no cost, but some care is required when dealing 
with discretized Gaussians. There are two costs which we must bound here. First, we must 


10 







show that swapping the pivot of a PMD is inexpensive, and second, we need to bound the 
cost of repeatedly merging Gaussians. 

We bound the cost of swapping the pivot by proving the following lemma: 

Lemma 4 (Total Variation Swap Lemma). For /r £ positive semidefinite S £ 
n £ Z, let 

— Xi be the distribution where p-i £ is p with the ith coordinate re¬ 
moved, and is S with the ith row and column removed; 

— Yi he the distribution in which we draw a sample (xi,... ,Xk-i) ~ Xi and return 

k-l 

([xi],..., (n - 

i=i 

Then d'Y\{Yi,Yj) < ^ for any i,j £ [k], where cr^ = max(a^i,a^j) and a‘^_i is the smallest 
eigenvalue o/S_j. 

By applying Lemma HJ we can make two discretized Gaussians have the same left out co¬ 
ordinate and then merge them using Lemma [2] if at least one of them has large variance 
in every direction. While each of the k discretized Gaussians starts with this property (for 
the dimensions in which it is non-deterministic), it is not clear whether this is true after a 
sequence of pivot swaps and merges. 

In many cases, swapping the pivot decreases the minimum eigenvalue of the distribution’s 
covariance matrix by a factor of poly (A:). This is acceptable if we only perform a single swap, 
but naively applying this bound for a sequence of k swaps and merges results in the minimum 
eigenvalue dropping by a factor of k^^^\ We show that such a bad situation cannot occur, no 
matter how one performs the sequence of swaps and merges, by proving the following lemma: 


Lemma 5 (Variance Swap Lemma). Let ..., be a sequence of symmetric 

positive-semidefinite matrices, and define = {j \ ejY^^^ej 0 } to he the set of coordinates 
in which is non-zero. Furthermore, let S = and S = Suppose the following 

hold for all i: 

1. has eigenvalue 0 with corresponding eigenvector 1 

2. There exists coordinate j* £ S^®^ such that has minimum eigenvalue at least A 

3 . (U£<i5W) n5(®) / 0 

Then, for all j £ S, the minimum eigenvalue is at least 

The details of this step, the proofs of Lemma 0] and Lemma [5] as well as the proof of Theorem 
[5] are described in Section IB.,SI 


4 Covers for PMDs 

In this section, we describe a pair of covers for (n, A:)-PMDs. 
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The first cover follows directly from Theorem [5l which gives a structural characterization of 
a (n, A:)-Poisson multinomial random vector as the sum of an appropriately discretized Gaussian 
and an /i;)-Poisson multinomial random vector. We grid over all possible mean vectors and 
covariance matrices for the Gaussian component, and all possible parameter values for the k)- 
PMD. These are covered by sets of size (n • poly(A;/e))^ and respectively, resulting in an 

overall cover of size 

Lemma 6. For all n,k £ N, and all e > 0, there exists an e-cover of the set of all {n,k)-PMDs 
whose size is 

_ 2Poly(fc/£)_ 

The proof of this lemma is presented in Section IG.ll 

The second cover further sparsihes the cover for the {tk‘^, A:)-PMD component, by using a mul¬ 
tivariate generalization of the moment matching technique described in |DP14| . This reduces the 
cover size for this component to In [Roo02] . Roos shows that a PMD can be 

written as the weighted sum of partial derivatives of a regular multinomial distribution. He goes 
on to show that dropping the higher order derivatives in this sum results in a total variation ap¬ 
proximation, where the quality of the approximation depends on the parameters of the PMD and 
the point at which we evaluate the derivatives. We take advantage of this tool to obtain an e- 
approximation, through a careful partitioning of the GRVs and choice of point at which to evaluate 
the derivatives of the multinomial distributions. This implies that any two distributions which have 
matching “moment profiles” (which roughly describe the lower order derivatives of the distribution) 
are e-close to each other, and thus only one representative element must be kept from each such 
equivalence class. The size of the cover follows by a counting argument on the number of moment 
profiles. 

Lemma 7. For all n,k € N, and all e > 0, there exists an e-cover of the set of all {n,k)-PMDs 
whose size is 

The proof of this lemma is given in Section 1C.21 We note that this cover can be efficiently 
enumerated over, using a dynamic program similar to that of |DP14j . 

By combining these two lemmas, we obtain Theorem [2j 


5 Learning PMDs 

As mentioned before. Theorem [2] combined with Theorem[7]below (taken from |DK14j ) immediately 
implies that (n, A;)-PMDs can be learned from 0(log N/e‘^) samples, where N is the size of our cover. 

Theorem 7 (Theorem 19 of [DK14| i. There is an algorithm FastTournament(A, 7^, e, 5), which is 
given sample access to some distribution X and a collection of distributions FL = {Hi ,..., fAjv} over 
some set V, access to a PDF comparator for every pair of distributions Hi,Hj G Fi, an accuracy 

parameter e > 0, and a confidence parameter <5 > 0. The algorithm makes O ^• log 
draws from each of X, Hi,Hj\f and returns some H £ FL or declares ‘failure.” If there is 
some H* £ H such that d^\{H*,X) < e then with probability at least 1 — 5 the distribution H 
that FastTournament returns satisfies d'r\{H,X) < 512e. The total number of operations of the 
algorithm is O (^NlogN + log^ . Furthermore, the expected number of operations of the 

algorithm is O ^ ^ \ogj^/s ^ ^ 
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1. Guess the block structure/partition of the coordinates. 

2. Estimate (using a single sample) the number of CRVs in each block. 

3. For each Gaussian in the block structure, use poly(A:)/e^ samples to find its mean vector 
and covariance matrix, as follows: 

(a) With poly(A;)/e^ samples, estimate the mean vector and covariance matrix of the PMD. 

(b) Gonvert these estimates to the mean and covariance of the Gaussian by searching over 
a spectral cover of positive semidefinite matrices. 

4. Guess the sparse component by enumerating over elements in either of the two covers. 

5. Run a tournament on the set of guessed distributions to identify one which is e-close. 

Figure 1: Steps of the learning algorithm 


Theorem [7] is using a tournament-style algorithm for hypothesis selection, which takes a set 
of candidate distributions and outputs one which is 0(e)-close to the unknown distribution (if 
such a distribution existsjl. Given that N is polynomial in n, the resulting sample complexity 
is logarithmic in n. To remove the dependence on n from our sample complexity, we need to 
exploit not just the size but also the Gaussian structure of the cover. Instead of trying all possible 
Gaussians that the cover could describe, we instead estimate the moments of the Gaussian directly. 

Our strategy will not be to generate an e-cover for all (n, fej-PMDs, but instead we take samples 
and select only distributions from our cover which are consistent with the data. Similar to before, 
we will apply Theorem [7] to do hypothesis selection but instead of applying it to the complete cover 
resulting from Theorem [2l we will apply it to a much smaller set of hypothesis that we obtain 
after making several “guesses” for the parameters of our distribution. At least one set of these 
parameters will be sufficiently accurate to obtain an e total variation distance guarantee and we 
will be able to determine a good candidate using Theorem [71 

The first step of our learning algorithm is to guess the block-diagonal structure of the Gaussian 
component of our distribution by guessing the partition of the coordinates and choosing an arbitrary 
pivot within each block. This requires at most guesses. Note that any choice of pivot in the 
partition is acceptable (as shown in Lemma 0] above). 

The next step is to guess the sum of the means for the Gaussian component within each 
block. We need this to know how to fill in the pivot coordinate once we sampled the rest of the 
coordinates in the block. This will be the number of CRVs which result in this block of the Gaussian 
component, and thus an integer between 0 and n. Since the total variation distance between the 
sampled distribution and the distribution from the cover is at most e, with probability at least 
1 — e, the sample has non-zero probability to be generated by the distribution from the cover. In 
this case, the sum of the sample’s values within each block will be equal to the sum of the means 
from the Gaussian component, plus the contribution from the sparse {tk'^, k)-PM.D component. 
Therefore, for each block, we can guess the sum of the means via the following procedure: Take a 
single sample V G and for each block B, guess the sum of the means to be ~ 

^ G {0,1,... ,tk‘^}. Since there are at most k blocks, this requires {tk"^ + 1)^ guesses. 

Next, we estimate the mean and covariance of the Gaussian component for each block. We need 

®We note that this tournament additionally requires a “PDF comparator,” which we describe for our setting in 
Section |D.41 
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to estimate them accurately enough in order to learn each block of the discretized Gaussians to 
within 0{£/k) in total variation distance. A useful tool for showing this is the following proposition: 

Proposition 1. Let /i, /i' G and S, S' G such that for all y G 

- h)\ < e\/y^S?/ and |y^(S' - S)y| < ey^^y. 

Then 

dTv(AA(//,S),AA(/i',S')) < 2£k. 

Proposition [1] implies that, in order to achieve the required bound in total variation distance, 
it suffices to get an estimate that approximately matches the mean and variance of the Gaussian 
component in every direction. In Section fP. 11 we prove Lemma[ 8 ]which shows that using poly(A:)/e^ 
samples from the PMD, we can get an estimate of the mean and covariance matrix that achieves this 
guarantee in every direction. However, this estimate is with respect to the PMD we are sampling 
from and not with respect to the Gaussian component, which is the guarantee we desire. 

Lemma 8. Given sample access to a (n,k)-PMD X with mean and covariance matrix T, (with 
minimum eigenvalue at least 1 ), there exists an algorithm which can produce estimates fi and T, 
such that with probability at least 9/10.’ 

- t)\ < and \y'^{t-T,)y\<£y'^T,y 

for all vectors y. 

The sample and time complexity are Oiff^ 

In order to obtain a guarantee for the Gaussian component, we observe that there are two 
possible sources of errors in our estimation: 

• The first source of error comes from the rounding step. In proving our structural result, the 
real PMD had to be rounded so that no CRV has any probability that is in the range (0, c), 
which affected the mean and covariance. In Section ID. 21 we show that this only affects the 
mean and variance in each direction up to a small multiplicative factor. 

• The second source of error is due to the existence of the sparse component creates an additional 
additive error in each direction. This error might be very significant in some directions as 
the variance of the Gaussian component can be very small compared to the number of sparse 
CRVs. 


Understanding that our estimation is off by an additive error and a multiplicative error, we show how 
to efficiently correct this estimation by searching around it for the underlying covariance matrice 
of the Gaussian distribution. In particular, we obtain a cover of positive semidehnite matrices 
that are close to the estimated covariance matrix and which contains a good approximation to the 
covariance matrix of the underlying Gaussian. This is challenging because the above two sources 
of error might affect the spectrum of the covariance matrix significantly. However, we are able to 
tackle this issue by carefully guessing appropriate corrections to the eigenvectors and eigenvalues 
of the matrix. We prove Lemma [9] which states that this cover has cardinality at most (fkje)^^^ \ 
and thus we can get a very accurate estimate for the underlying Gaussian distribution by guessing 
different points in the cover. 


Lemma 9. Let A be a symmetric k x k PSD matrix with minimum eigenvalue 1 and let S be the 
set of all matrices B such that \y'^{A — B)y\ < £\y'^Ay + £ 2 y'^y for all vectors y, where ei G [0,1/4) 

and £2 G [0,oo). Then, there exists an £-cover S^- of S that has size ISgl < ( j 
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At this point, we have a collection of distributions such that at least one is close to the Gaussian 
component. We do the same for the sparse PMD component by simply enumerating over all the 
elements in the cover. By reading the corresponding term from the statement of Theorem [21 this 
requires j guesses. 

In conclusion, using Y>o\y{k)/e^ samples, we have generated a set S of size 

which contains a distribution which is e-close to the true distribution with constant probability. In 
order to choose a “good” distribution from this set, we apply the hypothesis selection algorithm of 
Theorem [7] to obtain a distribution which is 0(e)-close to the unknown distribution with constant 
probability, which concludes the proof of Theorem |3l More details about the learning steps and 
complete proofs can be found in Section [Dl 

6 Learning fc-SIIRVs 

We demonstrate the expressive power of PMDs by demonstrating their applicability to learning 
(n, fe)-SIIRVs. In particular, we leverage our cover results to give a Ofc(l/e^) sample algorithm for 
this problem. 

The proof uses the structural result of |DDO'*~l,^ . which says that any (n, /i;)-SIIRV is close to 
either a low variance distribution with limited support, or a high variance distribution which enjoys 
certain Gaussian structural properties. 

Lemma 10 (Gorollary 4.8 of |DDO'*~13] l. Let S = Xi Xn be a {n,k)-SIIRV for some 

positive integer k. Let fi and cr^ be respectively the mean and variance of S. Then for all e > 0, 
the distribution of S is O{e)-close in total variation distance to one of the following: 

1 . a random variable supported on ^ consecutive integers with variance < 15(A;^®/e®) log^(l/e); 
or 

2 . the sum of two independent random variables Si + 082 , where c is some positive integer 
1 < c < k — 1, S 2 is distributed according to [AA(//, cr^)], and Si is a c-IRV; in this case, 
a2 = o(^log2(l/e)). 

As we did for PMDs, we will use the tournament based approach, in which we generate a set 
of probability distributions S, containing at least one distribution which is e-close to S. We then 
use Theorem[7|to select a distribution which is 0(e)-close to S, using 0(|5|/e^) samples. 

To cover the former case, we use the PMD cover of Theorem [21 In this setting, the SIIRV has 
a variance upper bounded by poly(A:/e). By applying a rounding procedure, it can be shown that 
this can be approximated by an offset (poly(fc/e),/c)-SIIRV. Recalling that any (n,A;)-SIIRV can 
be expressed as the projection of an (n, /c)-PMD onto the vector (0,1,... , A: — 1) and applying our 
quasi-polynomial cover result in Theorem [2] covers this case with ^ (lA)) candidates. 

To cover the latter case, we first perform A: — 1 guesses for the value of c G [A; — 1]. For 
each guess, we learn the two distributions Si and S 2 separately. To learn Si, we use the same 
approach as [DDO"*"!^ . which uses the empirical distribution obtained after mapping the samples 
onto {0,1,... ,c — 1} using their residue mod c. Our method for learning S 2 is novel - we first 
round the value of each sample down to the next multiple of c, and examine the distribution on 
this support, which will be close in total variation distance to 82 - We estimate the moments of 
this distribution using robust statistical tools, as in [PDO"*"!^ . The empirical median is used to 
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estimate the mean, and a rescaling of the interquartile range is used to estimate the standard 
deviation. Thus, we cover this case using only fc — 1 candidates, one for each guess of c. 

Full details are provided in Section [El 
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A Useful Tools 

A.l Probability Metrics 

To compare probability distributions, we will require the total variation and Kolmogorov distances: 


Definition 7. The total variation distance between two probability measures P and Q on a a- 
algebra F is defined by 

dTY{P,Q) = sup \P{A)-Q{A)\ = h\P-Q\\i. 

AeF 2 

Unless explicitly stated otherwise, in this paper, when two distributions are said to be e-close, 
we mean in total variation distance. 

Definition 8. The Kolmogorov distance between two probability measures P and Q with CDFs Fp 
and Fq is defined by 

dK{P, Q) = sup \Fp{x) - Fq{x)\. 

We note that Kolmogorov distance is, in general, weaker than total variation distance. In 
particular, total variation distance between two distributions is lower bounded by the Kolmogorov 
distance. 

Fact 1. dK{P,Q) < dp^y{P,Q) 


A.2 Probabilistic Tools 

We will use the following form of Chernoff/Hoeffding bounds: 

Lemma 11 (Chernoff/Hoeffding). Let Zi,..., Zm be independent random variables with Zi S [0,1] 
for all i. Then, if Z = 7 ^ 

Pr[\Z - E[Z]\ > -fE[Z]] < 2exp{-j^E[Z]/3). 


We note the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which is a powerful tool, giving a 
generic algorithm for learning any distribution with respect to the Kolmogorov metric [DKW56] . 

Lemma 12. (\DKW56f . \Mas90f ) Suppose we have n IID samples Xi,...Xn from a probability 
distribution with CDF F. Let Fn{x) = ^ l{Xi<x} be the empirical CDF. Then Pr[(iK(F, F„) > 
e] < In particular, if n = P((l/e^) • log(l/(5)), then Pic[dK{F,Fn) > e] < 5. 


We will use the Data Processing Inequality for total variation distance (see part (iv) of Lemma 
2 of |Reyll| for the proof). This lemma says that taking any function of two random variables can 


only reduce their total variation distance. Our statement of the inequality is taken from DDO'*~l,'l 
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Lemma 13 (Data Processing Inequality for Total Variation Distance). Let X^X' he two random 
variables over a domain D. Fix any (possibly randomized) funetion F on LI (which may he viewed as 
a distribution over deterministic functions on LI) and let F{X) he the random variable sueh that a 
draw from F(X) is obtained by drawing independently x from X and f from F and then outputting 
f{x) (likewise for F{X')). Then we have 

dTV {F{X),F{X')) < dTY {X,x') . 

Finally, we require a hypothesis selection algorithm. Roughly, given a set of N distributions 
with the guarantee that at least one is e-close to an unknown distribution X, we can choose a 
hypothesis which is 0(e)-close to X. The running time is near-linear in N and the number of 
samples is logarithmic in N. 

Definition 9. Let Hi and H 2 he probability distributions over some set T>. A PDF comparator 
for Hi,H 2 is an oracle that takes as input some x £ V and outputs 1 if Hi(x) > H 2 {x), and 0 
otherwise. 

Theorem 7 (Theorem 19 of [DK14| i. There is an algorithm FastTournament(V, "H, e, (5), which is 
given sample aceess to some distribution X and a collection of distributions H = {Hi,... ,Hj\[} over 
some set V, aceess to a PDF eomparator for every pair of distributions Hi,Hj G H, an aecuracy 

parameter e > 0, and a eonfidenee parameter 6 > 0. The algorithm makes O • log 

draws from each of X, Hi,Hj\f and returns some H £ H or declares “failure.” If there is 
some H* £ H sueh that d^\{H*,X) < e then with probability at least 1 — 5 the distribution H 
that FastTournament returns satisfies d^viH, X) < 512e. The total number of operations of the 
algorithm is O (^NlogN + log^ . Furthermore, the expected number of operations of the 

algorithm is O ^ ^ \ogj^/s ^ ^ 


A.3 Bounds for Distances Between Distributions 

Proposition 2 (Proposition B.4 of DDO^l^ i. Let p.i,H2 £ 1 ^ and 0 < ui < a2. Then 


dTY{X{pi,al),X{ia2,cr^)) < 


1 f\hi-T2\ , -o-f 


(Tl 


+ 


crt 


Proposition 3 (Proposition 32 in jVVTO]). Given two k-dimensional GaussiansMi = M{fii, Si), A/2 
M{fi2,^2) such that for all i,j £ [k], |Si(i,j) — S2(i,j)| < a, and the minimum eigenvalue 0/Si 
is at least a'^, 

ka 


rfTv (W.V,) < + 




\/27re(cj2 — a) 


Proposition 1 . Let pL, fj! £ and S, S' £ 


iXk 


, such that for all y £ 


lysin' - p)| < ey/y^Sy and |y^(S'- S)y| < ey^Sy. 


Then 


dTv(AA(/r,S),AA(p',S')) < 2 ek. 


Proof. Without loss of generality, assume y = 0, S = I, and S' is diagonal. This can be done by 
setting y = QA~^/‘^Q'^x, where S = QAQ^ and S' = Q'A'Q'^ are the eigendecompositions of S 
and S'. 
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This implies that we now have the following guarantees for all i G [k]: 

l/r'l < e and |ST - 1 | < e. 

Since each coordinate is independent and noting that ST > 1 — e, we can apply Proposition [2] 
to each coordinate direction to obtain a total variation distance of 2£k. □ 

Proposition 4 (Berry-Esseen theorem |Ber41llEss42[[ShelO] i. Let Xi,... ,Xn be independent ran¬ 
dom variables, with E[Xi] = 0,E[Xf] = af > 0,Ei[|Xjp] = pi < oo, and define X = = 

— Sr=i Pi- Then for an absolute constant Cq < 0.56, 

dKiX,MiO,a^)) < 

A.4 Covariance Matrices of Truncated Categorical Random Variables 

First, recall the definition of a symmetric diagonally dominant matrix. 

Definition 10. A matrix A is symmetric diagonally dominant (SDD) if = A and An > 
Ylj^i\Aj\ for all i. 

As a tool, we will use this corollary of the Gershgorin Circle Theorem |Ger31] which follows 
since all eigenvalues of a symmetric matrix are real. 

Proposition 5. Given an SDD matrix A with positive diagonal entries, the minimum eigenvalue 
of A is at least min* An — \Aij\. 

Proposition 6. The minimum eigenvalue of the covariance matrix E of a truncated CRV is at 
least p( 0 ) miuj p{i). 

Proof. The entries of the covariance matrix are 

Sjj = E[xiXj] - E[xi]E[xj] 

^ f Pii) - Piff if ^ i 

\-p{i)pU) else 

We note that S is SDD, since = p{i)Y.j^iP{j) = P(0(1 “ P(*) “ P(0)) < p(l)(l - 

p{i)) = Sjj. Thus, applying Proposition [5l we see that the minimum eigenvalue of S is at least 
miuj p{i){l - p{i)) - p{i){l - p{i) - p(0)) = p(0) miuj p{i). □ 

A.5 Sums of Discretized Gaussians 

In this section, we will obtain total variation distance bounds on merging the sum of discretized 
Gaussians. It is well known that the sum of multiple Gaussians has the same distribution as a 
single Gaussian with parameters equal to the sum of the components’ parameters. However, this 
is not true if we are summing discretized Gaussians - we quantify the amount we lose by replacing 
the distribution with a single Gaussian, and then discretizing afterwards. 

As a tool, we will use the following result from [DDO~*~l^ : 

Proposition 7 (Proposition B.5 in [DDO+IS] ). Let X ~ and A G M. Then 

dryHX + A], [A] + [A]) < —. 
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From this, we can obtain the following: 

Proposition 8. Let Xi ~AA(/ii, erf) and X 2 r\j AA(/i 2 ,fT 2 ). Then 

dTv(L^l+^ 2 l,L^ll + L^ 2 l)<^, 

where a = maxj crj. 

Proof. First, suppose without loss of generality that cJi > <72. 


dTv(L^l+^ 2 l,L^ll + L^ 2 l) 

. 00 

= - |Pr(LXi + X 2 I = i) - Pr(LXil + LX 2 I = i)| 


t=—oo 
00 


< 


1 ^ /*CX) /•CO 

/ /x.(A)Pr(LXi + Al =i)dA- / /x,(A)Pr(LXil + LAl=i)dA 

1 /•CO 

^ E / /x.(A)(Pr(LXi + A1 = i) - Pr(LXil + [A) = i)) dX 

^ ^ J — 00 

t=—oo 

-j ^co 

2 E /_ /x.(A) |(Pr(LXi + A1 = i) - Pr(LXil + [A) = i))| dX 

i=—oo °° 

^/^^/x,(A)(e l(Pr(L^i + Al =i)-Pr(LXil + LAl =i))|^ dA 


^00 J 

< / fx,{X)—dX 

J —00 2(Ji 

1 

The second inequality uses Proposition [71 


□ 


This leads to the following lemma: 

Lemma 2. Let Xi ~ Si) and X 2 ~ M{fi 2 , S 2 ) be k-dimensional Gaussian random variables, 

and let a = minj max, Uj j where aij is the standard deviation of Xi in the direction parallel to the 
jth coordinate axis. Then 

dTV (L^l + ^ 2 l, L^ll + M) < 7^. 

zer 

Proof. The proof is by induction on k. The base case of /c = 1 is handled by Proposition jS) For 
general k, we use a standard hybridization argument. Denote the jth coordinate of Xi as Xij. 

dTv(L^l+^ 2 l,L^ll + M) 

= dTV + X2i \,..., [xik + X 2 fcl), (La^iil + [3:211, • • •, [xik] + [3:2^1)) 

< dTV (([3:11 + 3 ; 2 i 1 , . . . , [3:1^ + 3 : 2 fcl), ([xii] + [X 2 ll, ..., [xik + X 2 fcl)) 

+ dxv (([3:111 + [3:21!, ■ ■ ■ , [xik + 3 : 2 fcl), (bill + [X 2 ll, ..., bifcl + [X 2 fcl)) 

< dTY ((bn +X21I,..., bi(fc_i) + 3 ; 2 (fe-i)l),(biil + b 2 il,..., bi(fc_i)l + b 2 (fc_i)l) 

+ ^TV ( blfc + 3 ; 2 fcl, blfcl + b 2 fcl ) 

k — 1 1 k 

< - + — = — 

2 cr 2cj 2cj 
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The first inequality is the triangle inequality, the second uses Lemma [131 and the third uses the 
induction hypothesis and Proposition |8l □ 


B Details from Section [3] 


B.l Rounding the Parameters 


Fix some coordinate x, and select all /c-CRVs where the parameter in coordinate x is in the range 
(0,c). Partition this subset into k — 1 sets, depending on which coordinate y x is the heaviest. 
We apply a rounding procedure separately to each of these sets. After this procedure, none of the 
parameters in coordinate x will be in (0, c). We repeat this for all k possible settings of x. From the 
description below (and the restriction that c < ^), it will be clear that we will not “undo” any of 
our work and move probabilities back into ( 0 ,c), so 0{k‘^) applications of our rounding procedure 
will produce the result claimed in the theorem statement. 

Recall that the goal of this rounding procedure will be to shift probability mass either to or 
from coordinate x to coordinate y, such that no parameter in coordinate x lies in the interval ( 0 , c), 
while simultaneously approximately preserving the mean vector of the distribution. We are able 
to do this since coordinate y is “heavy” and thus small additions will not affect the distribution in 
this coordinate much. 

We hx some x, y in order to describe and analyze the process more formally. Dehne l^ = {i\0< 
p{i, x) < cAy = arg maxj p{i, j)} (breaking ties lexicographically), and let be the (n, A:)-PMD 

induced by this set. For the remainder of this section, without loss of generality, assume that the 
indices selected by Xy are 1 through \Iy \. 


Select an arbitrary set IZ XXy such that \R\ 


Ply b 
c 


Intuitively, this set will be 


the CRVs for which we set the parameter p{-,x) to be c, while Xy\TZ will have / 5 (-,x) set to 0. We 
can perform the following rounding scheme to pi^ to obtain a new parameter matrix pjx : 


c if j = xAi^TZ 

0 a j = X Ai ^TZ 

We define the process Fork, for sampling from a /c-CRV p{i, ■) in 1^: 

• Let Xi be an indicator random variable, taking 1 with probability ^ and 0 otherwise. 

• \i Xi = 1, then return e^, with probability kp{i,x) and with probability 1 — kp{i,x). 

• \i Xi = 0 , then return Cj with probability 0 if j = x, ■j^{p{i,x) + p{i,y) — if j = y, and 
-j^p{i,j) otherwise. 

The intuition behind this procedure is that we isolate the changes in our rounding procedure 
when Aj = 1, as when Xi = 0, the rounded and unrounded distributions are identical. We note that 
Fork is well defined as long as p{i, x) < ^ and p(i, x) + p(i, y) > ^. The former is true since c < ^, 
and the latter is true since y was chosen to be the heaviest coordinate. Additionally, by calculating 
the probability of any outcome, we can see that Fork is equivalent to the regular sampling process. 
Define the (random) set X = {i\ Xi = 1}. We will use 6 to refer to a particular realization of this 


P/-(bj) = ^ 
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set. We define Fork for sampling from />(i, •) in the same way, though we will denote the indicator 
random variables by Xi and X instead. Note that, if c < the process will still be well defined 
after rounding. This is because p{i,x) < c < ^, and p{i,x) + p{i,y) = p{i,x) + p{i,y) > For the 
rest of this section, when we are drawing a sample from a CRV, we draw it via the process Fork. 

The proof of Lemma [T] follows from the following three lemmata. Intuitively, the first states 
that the PMD induced by the CRVs for which Xi = 1 gives a Poisson Binomial distribution 
with mean concentrated around its expected value, for both the rounded and unrounded PMDs. 
The second states that if this value is concentrated, then the two distributions are close in total 
variation distance. The proof relates the rounded and unrounded distributions by comparing the 
total variation distance between the Poisson distributions with the same means. The third lemma 
eliminates the condition on the second lemma by using the first lemma, which states that this 
condition is likely to hold. 

Lemma 14. If Pih x) > Scfclog (^), then 


Pr 



kpi- 

ie9 iGX 


A 


i(zx 



> 1 — ick 


Lemma 15. Suppose that, for some 6 , the following hold: 


Y^P^^^kj) - - (sc/clog f ^ 

iee lex V 


Y ^Ply i'^~^[Y ^P^y 

i(zx 


< 


3cA:log 


1/2 


1/2 


iex 


Then, letting Zi be the Bernoulli random variable with expectation kpi^{i,x) (and Zi defined simi¬ 
larly with kpix[i,x)). 


dTV (Y^i^Y^^^ < ^ (^cl/2feV2logl/2 


Kieo ieo 


\ck 


Lemma 16. For any X^, 


dTV ,M^^y) < O ( c^/2;.i/2 iQgi/2 


ck 

Since our final rounded (n, A:)-PMD is generated after applying this rounding procedure 0{k‘^) 
times. Lemma [1] follows from our construction and Lemma [16] via the triangle inequality. 

Proof of Lemma [77]' Note that kpi^{i,x) = ^*5 where 


Ili = 


kpi^{i,x) with probability ^ 

0 with probability 1 — ^ 
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I 3 log 1 

We apply Lemma [TTl to the rescaled random variables fl' = with 7 = 


Pr 


Unsealing the variables gives 


Pr 





( 3 ‘°e(i)£[Efi; 

\ 1/2' 

EA 


> 


i&I- 

i&i- 


V ^ ^ ie/- 

/ J 


< 2ck. 




/ / 1 \ r 

\ 1/2' 

Y kpi-{i,x) 

- -F [ J]] /cp/-(z, x) 

> 3cA: log 1 — ) ^ kpi^ (z, x) 


iex 

lex 

\ VC / 

/ 


Applying the same argument to pi^ gives 




> j 3cA: log (E \ Y kpi^ (a x) 

\ 1 / 2 ' 

Pr 

Y ^P^y (^^^)-E[Y ^Ply ^b *) 



iGX ieX 

\ VC / 

/ J 


< 2ck. 


< 2ck. 


Since W ~ W, by considering the joint probability space where 0 = X = X and applying a 
union bound, we get 


Pr 0 : 


A 


kpi- ii,j) -E^Y (b 3 ) 

ee iex 


< ( 3cA; log 


E\Y^Ply‘<^^3) 


iex 


iex 


< I Sc/clog 

\ iex 


1/2 


l/ 2 \ 


> 1 — 4ck. 


□ 


Proof of Lemma{IM' Fix some 0 = X = X. Without loss of generality, assume E X^jgx ^Pi"^ (b i) 


> 


E 


Eiex kpi-iij) 


There are two cases: 


Case 1. E 


Eiex^/5/4z,j)j < (c/c)3/4 


From the first assumption in the lemma statement. 


Y ^P^y (b i) < -F [ ^ kpi^ (z, j) + I 3cA: log f ^ j ^ ^ kpi. (z, j) 
i&e iex \ V'' / 

< + V^{cky/^ log^/^ (^) 


1/2 
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Similarly, by the second assumption in the lemma statement and since 

Eiex kpi^{i,j) > E kpi^{i,j) , we also have that kpi^{i,j) < g{c, k). 


By Markov’s inequality, Pr 
< g{c, k). This implies that 


> 1 < T,ieekpiAi,j) < 9 {c,k), and similarly, Pr Y.i&e ^ 


= 0 
i&e 


-Pr = 0 

i&e 


< 25(c, A:), 


and thus by the coupling lemma, 

c^TV X] /c) = 4 (^ckf/^ + y/^{ckf/^ log^/^ ^ 


Case 2. E 


iee iee 

Y.i&xkpidi,j)] > (cA:)3/4 


We use the following proposition, which is a combination of a classical result in Poisson approx¬ 
imation |BHJ92] and Lemma 3.10 in |DP07] . 

Proposition 9. For any set of independent Bernoulli random variables {Zi}i with expectations 
E[Zi] < ck, 


Applying this, we see 


c^TV ( Zi, Poisson^E Zi 


d-TV [E Zi , Poisson (^E Zi 
\ie0 iee 


< ck. 


< ck 


< ck 


d-TV E Zi , Poisson (^E Zi 

\ie0 ie0 / 

We must now bound the distance between the two Poisson distributions. We use the following 
lemma from |DP08] : 


Lemma 17 (Lemma B.2 in |DP08] ). If X = Xq + D for some D > 0, Xq > 0, 

T 


dTV {Poisson{X), Poisson{XQ)) < Dy 


Applying this gives that 


(Atv ( Poisson (e Zj ^ 

V iee 


I, Poisson ( E Zi 
ie0 


< 


e[Y,z. 

ie0 


-E[Y,Zi 

ie0 


^ min|.E Eiee } 

To bound this, we need the following proposition, which we prove below: 
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Proposition 10. 


2 EiG0 ihx)- Eiee 


< W SOc/c log ( — 

min|X].g0A:p/.(i,x),X;i60A:/57.(i,x)| V 

Thus, using the triangle inequality and this proposition, for sufficiently small c, we get 

dTV < 2c/c + ^SOcklog = O iQgl /2 ^ 

By comparing Cases 1 and 2, we see that the desired bound holds in both cases. 

Proof of Proposition I idl By the definition of our rounding procedure, we observe that 


E['^kpi.{i,x) - E^'^kpi.{i,x) 


< c 


*ex jgx 

By the assumptions of LemmafT^ and the assumption that E Xljgx kpi^ {i, j) 


> E 


Eigx kpi4i,j) 


J2kpi-ii,x) -^kpi.{i, 


X 


i&6 


iee 


< 


E['^kpi.{i,x) - E^'^kpi.^i,: 


iex 


i&X 


+ ^12c/clog E['^kpi^{i,x) 


1/2 


< c + 1 12cA: log ( ^ j [ X] ^P^y 
' i&X 


1/2 


and thus. 


^ kpi^ {i, x)-Y^ kpi^ {i, x) 


i&e 


i&e 


< c + 12cA; log ( — ) £11 


kpi^{i, x) 


iex 


1/2 


iex 


( 1 ) 


From the assumption that E 


Yliex kpi^{hx) > (cA:)^/'^, for sufficiently small c, 


E [ Y1 ^Ply ( ^ [ X] ' 

iex \ iex 


1/2 


kpiAi,x) 


> 


( 12cA: log F; [ ^ kpi. {i,j) 

\ jgx 


1/2 
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Combining this with the first assumption of Lemma [T^ 


(b a:) ^ [ X] ( ^ ) [ X] ^P^y 

iee lex V 


1/2 


> 


^E[j2kpi^ii 


iex 


Similarly, since E Xlipx PPE (b j) 


> E 


i,x) 

Siex ^Piy (b •?) “ c > {ckY/^ — c, for c sufficiently small, 
1 


Y1 ^PI^ (b - 2 ^ ^ 


iee 


iex 


It follows that 


mm 


. iee 


^kpi.{i,x),^kpi:r{i,x) \ 

i&e J 

E kpi.{i,x'^,E E ^Ply (b 
iex 

/c/57.(/,a 


> — min 
“ 2 


iex 


i&X 


^2 ^ 




— c 


i&X 


1 

4' 


iex 


(2) 


where the last equality follows for c sufficiently small because E 
From © and ([2]), for c sufficiently small. 


T.i&xkpiAi,x)\ > (c/c)3/4. 


2 Eiee (b - Eiee (b 


mm ■ 


{Eiee ^Pi^ (b x), Eiee ^Pi^ (b 3:)} 
from which the proposition statement follows. 


< SOcfe log 


ck ) ’ 


□ 


□ 

Proof of Lemma \JR' First, note that if p{i,x) < Sc/clog (^), The £i distance between the 

parameters of the rounded and the unrounded distributions is at most bcfelog (^). By the triangle 
inequality and the Data Processing Inequality iLemma I13p . this is an upper bound for the total 
variation distance between the rounded and unrounded distributions, and the desired conclusion 
holds. Therefore, for the remainder of the proof, assume that p{hx) > 3cklog (^). 
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Throughout this proof, we will couple the two sampling processes such that 0 := X = X, which 
is possible since X ^ X. Let (j) be the random event that 6 satisfies the following conditions: 


~ - f 3cA:log ( ^ j) 

iee i&x \ J jgx 


2 ,ck log f ^ [ X] ^P^y 


X] ^Ply (b 3) -E\^Y^ Ppi- (L j) 

i&x 


1/2 


1/2 


< 


iex 


Suppose that ip occurs, and fix a 0 in this probability space. We start by showing that for such 


a 0, 


dTV ( M'^^y , M^^y 


x = x = e^ <0 iog^/2 


Let M^^y and M^^y be the (n, /c)-PMDs induced by the /c-CRVs in with indices in 6 and not 


fPi?: 


T^I?. 


in 6, respectively. Define M'^^y and similarly. We can see 


dTV ( M^^y , 


X = X = 0 ) = dTV ( M^^^y + M^^^y , 


X = x = e 


< dTV ( , M^'y 


+ dTV ( M^^y , M^^y 


< dTV ( 


X = x = e 
X = x = e 
X = x = e 


dTV Zi 


\i&e ieo 


X = x = e 


<0 c 


d/2fcV2logl/2 (1_ 

\ck 

The first inequality is the triangle inequality, the second inequality is because the distributions for 
/c-CRVs in 6 are identical (since we do not change them in our rounding), and the third inequality 
is Lemma fTCl 

By the law of total probability for total variation distance. 


dTV ( M'^^y , M^^y ) = Pr((/.)dTV ( M^^y , M^‘y p ) + Pr((/))dTV ( M^’y , M^’y 


Pif, T\/rPi?. 


fPi?, i\/rPif, 


< (1 — 4cA;) • O log^^^ ^ 


where the inequality is obtained by applying Lemma [TT] and the bound shown above pointwise for 
6 which satisfy p. 

□ 
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B.2 Converting to a Discretized Gaussian using the Valiant-Valiant CLT 


We will now apply a result by Valiant and Valiant [VV10| . We recall the aforementioned CLT by 
Valiant and Valiant, Theorem [ 6 l which we restate for convenience. 

Theorem 6 (Theorem 4 from |VVin| ). Given a generalized multinomial distribution , with k 
dimensions and n rows, let fr denote its mean and S denote its covariance matrix, then 

^4/3 

dxv (G^LAA(/x,S)l) < ^•2.2-(3.1+ 0.83 log n)2/3 
where is the minimum eigenvalue o/S. 

As we can see from this inequality, there are two issues that may arise and lead to a bad 
approximation: 

• has small variance in some direction (cf. Proposition [B]) 

• GP has a large size parameter n 

We must avoid both of these issues simultaneously - we will apply this result to several carefully 
chosen sets, and then merge the resulting Gaussians into one using Lemma [2l 

The first step is to partition our CRVs into several sets, and then convert the PMDs induced by 
each set into GMDs (with an appropriately chosen pivot). The original PMD can be sampled by 
sampling each of these GMDs and then adding their results. In other words, the probability mass 
function of the PMD is the convolution of the probability mass functions of these GMDs. 

We start by partitioning the A:-CRVs into k sets Si, , Sk, where Sj/ = {i \ j' = arg maxj 7 r(i, j)} 
and ties are broken by lexicographic ordering. This defines Sji to be the set of indices of fc-GRVs in 
which j' is the heaviest coordinate. Let MV' be the (IjS'j'I, A:)-PMD induced by taking the /c-GRVs 
in Sji. For the remainder of this section, we will focus on S^, the other cases follow symmetrically. 

We convert each CRV in Sk into a truncated A;-GRV by omitting the kth coordinate, giving 
us a (l^fcl, A:)-GMD GP'^ . Since the kth coordinate was the heaviest, we can make the following 
observation: 

Observation 1. pk{i,0) > ^ for all i £ Sk. 

If we tried to apply Theorem [ 6 ] to GP'^ , we would obtain a vacuous result. For instance, if 
there exists a j such that Pk{i,j) = 0 for all i, the variance in this direction would be 0 and 
Theorem [B] would give us a trivial result. Therefore, we further partition Sk into 2^“^ sets indexed 
by where each set contains the elements of Sk which are non-zero on its indexing set and 

zero otherwise. More formally, S^ = {i\{i £ Sk) A {p{i,j) > c'ij £ I) A {p{i,j) = 0 Vj 0 X)}. 
For each of these sets, due to our rounding procedure, we know that the variance is non-negligible 
in each of the non-zero directions. Naively, we would apply the GLT separately to each of these 
sets. The issue is that merging the resulting 2^ Gaussians would be costly. Roughly, merging two 
Gaussians into one incurs a cost proportional to the inverse of the minimum standard deviation of 
either Gaussian. In order to avoid the cost of merging exponentially many Gaussians with similar 
variances, before applying the CLT, we group sets S^ of similar variance together. The resulting 
collection of Gaussians have variances which increase rapidly, and by merging them in the correct 
order, we can minimize this cost. 

Recall that 7 = 0(1) and t = poly(A:/e) (as specified in Section I^TTI) . For an integer I > 0, 
define R* = UxeQ^ 'S’i') where Qi = {I\ \S^\ £ [Pt, (/ + I)"^t)}. In other words, bucket I will contain 
a collection of truncated CRVs, defined by the union of the previously defined sets which have a 
size falling in a particular interval. 

At this point, we are ready to apply the central limit theorem: 
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Lemma 18. Let be the {\B^\,k)-GMD induced by the truncated CRVs in B\ and and 
be its mean and covariance matrix. Then 


rfrv {oA, 


8.646A;3/2 log^/^{2’^{l + l^t) 


Furthermore, the minimum non-zero eigenvalue of is at least . 

Proof. This follows from Theorem [ 6 l it suffices to bound the values of “n” and “cr^” which appear 
in the theorem statement. 

B^ is the union of at most 2 ^ sets, each of size at most {I + which gives us the upper bound 
of 2^{I + l)'^t as the size of induced GMD. 

We must be more careful when reasoning about the minimum eigenvalue of - indeed, it 
may be 0 if there exists a f such that for all i, ffj^{i,j') = 0. Therefore, we apply the CLT on 
the GMD defined by removing all zero-columns from pj,, taking us down to a dimension k' < k. 
Afterwards, we lift the related discretized Gaussian up to k dimensions by inserting 0 for the means 
and covariances involving any of the k — k' dimensions we removed. This operation will not increase 
the total variation distance, by Lemma [T3j From this point, we assume that all columns of are 
non-zero. 

Gonsider an arbitrary which is included in BK Let Ex = spanjcj \i G X}. Applying Propo¬ 
sition El Observation [H and the properties necessary for inclusion in S^, we can see that a GRV 
in has variance at least | within Ex. Since inclusion in B^ means that \S^\ > Pt, and variance 
is additive for independent random variables, the GMD induced by has variance at least Pt^ 
within Ex- To conclude, we note that if a column in is non-zero, there must be some X* G Qi 
which intersects the corresponding dimension. Since causes the variance in this direction to be 
at least Pt^, we see that the variance in every direction must be this large. This also implies the 
bound on the minimum non-zero eigenvalue of 

By substituting these values into Theorem El we obtain the claimed bound. □ 


We note that this gives us a vacuous bound for B^, which we must deal with separately. The 
issue with this bucket is that the variance in some directions might be small compared to the size 
of the GMD induced by the bucket. The intuition is that we can remove the truncated GRVs which 
are non-zero in these low-variance dimensions, and the remaining truncated GRVs can be combined 
into another GMD. 


Lemma 3. Let G^° be the {\B^\,k)-GMD induced by the truncated CRVs in bucket B^. Given p^, 
we can efficiently compute a partition of B^ into S and S, where |5| < kt. Letting ps o-n-d S 5 be 

'■S — 

the mean and covariance matrix of the {\S\,k)-GMD induced by S, and G^k be the {\S\,k)-GMD 
induced by S, 


dxv [AA(/X5,S5)] * G^k^ 


8.646A:3/2 \og^P(2^t) 


Furthermore, the minimum non-zero eigenvalue of S 5 is at least ^. 


Proof. The algorithm iteratively eliminates columns which have fewer than t non-zero entries. For 
each such column j, add all truncated GRVs which have non-zero entries in column j to S. Since 
there are only k columns, we add at most kt truncated GRVs to S. 

Now, we apply Theorem El to the truncated GRVs in S. The analysis of this is similar to the 
proof of Lemma EBl As argued before, we can drop the dimensions which have 0 variance. This 
time, the size of the GMD is at most 2^t, which follows from the definition of B^. Recall that the 
minimum variance of a single truncated GRV in S is at least | in any direction in the span of its 
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non-zero columns. After removing the CRVs in S, every dimension with non-zero variance must 
have at least t truncated CRVs which are non-zero in that dimension, giving a variance of at least 
Substituting these parameters into Theorem [ 6 ] gives the claimed bound. □ 

We assemble the two lemmata to obtain the following result: 

Lemma 19. Let he a {n,k)-GMD with pk{i,j) ^ (O^c) and Ylj Pkihj) < 1 ~ for all i, and 
let Sk be its set of component truncated CRVs. There exists an efficiently computable partition of 
Sk into S and S, where IjSI < kt. Furthermore, letting ps o,nd S 5 he the mean and covariance 
matrix of the {\S\,k)-GMD induced by S, and G^k be the {\S\,k)-GMD induced by S, 

dTV (G'^^ < O + cl/ 2 ^ 1 / 2 ^ • 


Furthermore, the minimum non-zero eigenvalue of S 5 is at least ^. 


Proof. This is a combination of Lemmas [18] and O with the results merged using Lemma [2] 

As described above, we will group the truncated CRVs into several buckets. We first apply 
LemmafTHlto each of the non-empty buckets for I > 0. This will give us a sum of many discretized 
Gaussians. If applicable, we apply Lemma |3| to to obtain another discretized Gaussian and a 
set S oi < kt truncated CRVs. By applying Lemma[2l we can “merge” the sum of many discretized 
Gaussians into a single discretized Gaussian. By triangle inequality, the error occured in the 
theorem statement is the sum of all of these approximations. 

We start by analyzing the cost of applying Lemma [T8j Recall 7 = 6 -|- <^7 for some constant 
<57 > 0. Let the set of N non-empty buckets be A. Then the sum of the errors incurred by all N 
applications of Lemma [THI is at most 


Eo 

lex 


( A:3/2 log2/3(2^(Z 1)(6+V)t)' 




^(6-I-(5t,)/6^1/6(,1/6 


^ ^ /P/2iQg2/3(2fc(^ l)(6+'57)i) 

— ^ /( 6 -l- 5 .y)/ 6 ^ 1 / 6 gl /6 

— 2 ^ I ^(6-l-(5^)/6^1/6gl/6 ) 

fcl3/6log2/3^ ^ / log2/3^ \ 

“ c^/e^i/e ^ I /(6-i-5t,)/6 ) 

- cV 6 fl /6 ) 

- cV6il/6 j 


for any constant 0 < V < ^ 7 . The final inequality is because the series n~'^ converges for any 

c > 1 . 

The cost of applying Lemma [3| is analyzed similarly, 


8.646A:^G log2/3(2^t) ^ log^^^ t \ 

tWTe - cV6ti/6 j 


Finally, we analyze the cost of merging the V -|- 1 Gaussians into one. We will analyze this 
by considering the following process: we maintain a discretized Gaussian, which we will name the 
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candidate. The candidate is initialized to be the Gaussian generated from the highest numbered 
non-empty bucket. At every time step, we update the candidate to be the result of merging itself 
with the Gaussian from the highest numbered non-empty bucket which has not yet been merged. 
We continue until the Gaussian from every non-empty bucket has been merged with the candidate. 

By Lemma m the cost of merging two Gaussians is at most O (§), where is the minimum 
variance of either Gaussian in any direction where either has a non-zero variance. From Lemma [THl 
the variance of the Gaussian from is at least in every direction of non-zero variance. Since 
we are considering the buckets in decreasing order and merging two Gaussians only increases the 
variance, when merging the candidate with bucket I, the maximum cost we can incur is ^ ^.y/ 2 ^ 1 / 2 ^ 1/2 ^ ■ 
Summing over all buckets in A, 


Eo 

lex 


(^/7/2cV2tl/2 j 


< 


^3/2 

cV2il/2 


Eo 

i=i 


1 


l{6+Sj)/2 


n( ^ 

- ^ 1 cV2tl/2 


where the second inequality is because the series converges for any c > 1. We note 

that, from Lemma [3l the variance of the Gaussian obtained from is at least ^ in any non-zero 
direction. Therefore, merging this Gaussian with the rest does not affect our bound asymptotically. 
Since the minimum non-zero variance of any Gaussian we merged was at least the same holds 
for the resulting merged Gaussian and its minimum non-zero eigenvalue. 

By adding the error terms obtained from each of the approximations, we obtain the claimed 
bound on total variation distance. □ 


B.3 Merging k Gaussians into one 

In order to merge the k discretized Gaussians into one, we perform a series of “swap-and-merge” 
operations, in which we swap the pivots of two discretized Gaussians to be the same, and then 
merge the resulting distributions into one. We repeat this process until all Gaussians which overlap 
in some dimension are merged together. The following lemma bounds the cost of swapping a pivot. 

Lemma 4 (Total Variation Swap Lemma). For positive semidefinite E £ n £ Z, let 

• Xi be the distribution AA(/r_i,E_j), where fV-i £ is p, with the ith coordinate removed, 

and is S with the ith row and column removed; 

• Yi be the distribution in which we draw a sample (xi,... ,Xk-i) ~ Xi and return 

k-l 

([xi],..., [xi_i], (n - [xi'\,.. ., [xfc_il). 

Then dT\/{Yi,Yj) < ^ for any i,j £ [k], where cr^ = max(cr^^,cr^^) and cr^j is the smallest 
eigenvalue of'T-i. 

Proof. Without loss of generality, assume {i,j) = (1,2) and a ‘^2 ^ Sampling from Yi can be 
described by the following process; Draw a sample x^] 2 ~ ^- 1 , 2)5 which is the Gaussian 

obtained from M{pi, S) by projecting on to all dimensions except 1 and 2. Now, condition on x^] 21 
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sample the 2 nd coordinate as the one dimensional projection onto 62 of S) conditioned 
on 21 discretize all these values (i.e., round them to the nearest integer), and then set the 1 st 
coordinate to be = n — ~ • 

(‘Y] 

Similarly, to draw a sample from 12 , we first sample x_[ 2 ~ A^(/r_i^ 2 , 51-i,2)- We then condition 
on 2 , sample the 1 st coordinate x[^^ as the one dimensional projection onto ei of AA(/z,S) 
conditioned on x_{ 2 , discretize all these values, and then set the 2 nd coordinate to be [x^ = 

E£>3L®ri - 

We couple the two sampling processes by letting x^| 2 = ^-1 2 X-i^ 2 - With this in mind, we 

note that x[^^ + x^^^ = x[^^ + x^^^ = n — where x^^^ and x^^^ are the “unrounded” 

versions of these coordinates, x^^^ is distributed independently and identically to X|^\ and similarly 
for X 2 ^^ and x^^^. We also define n' to be n — • Ignoring the dimensions besides 1 and 2 

(since they are coupled to be identical), the total variation distance between Yi and I 2 is equal to 
the distance between (n' — [x^^^], [x^^^]) and ([xp^],n' — [x[^^]). By Lemma [T^ this is at most 
the total variation distance between n' — and [x^^^]. Therefore, it suffices to upper bound 

the total variation distance between n' — and [n — Since n' is fixed, this is equal to 

the total variation distance between and [x® + z], where 2 is some constant between 0 and 

k — 1. Again using Lemma [T3l this is upper bounded by the distance between x^^^ and x® + z. 
Proposition [ 2 ] bounds this by ^ as desired. □ 

As shown in Lemma [21 merging two Gaussians is cheap, assuming the minimum eigenvalues of 
their covariance matrices are sufficiently large. The following lemma shows that this value stays 
large throughout the sequence of swap-and-merge operations. 

Lemma 5 (Variance Swap Lemma). Let ..., G he a sequence of symmetric positive- 
semidefinite matrices, and define = {j \ 7^ 0} to be the set of coordinates in which 

E« 

is non-zero. Furthermore, let E = ^ Suppose the following hold for all 

i: 


1. E^*^ has eigenvalue 0 with corresponding eigenvector 1 

2. There exists coordinate j* G 5^®^ such that has minimum eigenvalue at least X 

3. (U£<iSW) n SW ^ 0 


Then, for all j G S, the minimum eigenvalue is at least 



Proof. We need to prove that for all vectors y G such that yj = 0 and ||y ||2 = 1, y'^Ty > 

We have that y^Ey = Y^-y'^T,^’‘'>y > max^y^E^®) y since all matrices are positive semidefinite. 
We now consider a coordinate f of y with maximum absolute value which has weight at least 
Since the covariance matrix E is the result of summing matrices with common coordinates (by 
property 3 in the lemma statement), there is a sequence of coordinates starting from f and ending 
with j that has length at most k, such that any two consecutive coordinates belong to at least one 
of the sets 5^®). Since \yj'\ > while yj = 0, it means that there exists a pair (a, b) of consecutive 

coordinates in the path such that |ya — 
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Consider such that a,b G 5^*^ Let j* € be the coordinate such that has 

minimum eigenvalue at least A. We have that: 

= {yg(i) — — yj*lgii)) > A||y5'(i) — yj*ls(^'>\\2 

where the second equality follows by property 1 in the lemma statement and the last inequality 
follows since has minimum eigenvalue at least A. Moreover since \ya — yb\ > we have 

that ||y5(i) “yi*lsW Hi — {ya — yj*)‘^ + {yb — yj*)‘^ ^ which completes the proof of the lemma. □ 


Finally, with these two lemmas in hand, we can conclude with the proof of Theorem [5j 
Proof of Theoreml^' First, we justify the structure of the approximation, and then show that it can 
be e-close with our choice of the parameters c and t. We start by applying Lemma [T] to obtain a 
PMD such that '7r(i, j) ^ (0, c) for all i,j. Partition the component CRVs into k sets Si,, Sk, 
where the ith CRV is placed in the Ith. set if I = argmaxj Tr{i,j) (with ties broken lexicographically). 
Since index I is the heaviest, every CRV i in Si has p{i,l) > p We convert the PMD induced by 
each Si to a GMD by dropping the Ith column. Applying Lemma [19] to each set and summing 
the results from all sets gives us a sum of k Gaussians with a structure preserving rounding and 
a A:)-Poisson multinomial random vector. Now, we iteratively merge the k Gaussians: while 
there exists a pair of Gaussians who overlap in some dimension i (i.e., there exists a dimension 
i such that both Gaussians are not deterministically 0), we merge them. To do this, we adjust 
the structure preserving rounding of both of the Gaussians to have pivot position i (justified by 
Lemma Hj), and then combine them by replacing their sum with a single Gaussian with a structure 
preserving rounding (using Lemmas [5| and [2]) . This new Gaussian will have the same pivot i, and 
a mean vector and a covariance matrix equal to the sum of the two components. We repeat until 
we are left with a set of Gaussians which do not overlap, and then combine them into a single 
Gaussian with a structure preserving rounding, where each of the (disjoint) Gaussians corresponds 
to a different block. We note that Lemmas m and [5] justify the minimum eigenvalue of each block 
of the covariance. 

Now, we show that our choices of c and t make the resulting distribution be e-close to the orig¬ 
inal. Applying Lemma [D introduces a cost of O log^^^ (^)) ™ approximation. We ap¬ 

ply Lemma[T9]fc times (once to each set Sf), so the total cost introduced here is O /i/Sfi/e * + 

Lemma m shows that each pivot swap costs ^ in total variation distance. Lemma [5] combined with 
[19] imply that and there are at most 2k pivot swaps, so this sequence of swaps costs at 

most Similarly, by Lemma[2l each our (at most) k merges costs ^ < ■^=. Therefore, the total 
variation distance introduced in this entire sequence of operations is 


O cV2A:5/2iogi/2 


^19/6 log2/3 ^ ^5/2 ^4 

+ cV6il/6 + cV2il/2 + 


Recalling our choice of parameters, c 
variation distance which is 0(e). 




for 6 c, 6t > 0, this results in a total 


□ 


C Details from Section [4] 

C.l A Direct Cover 

In this section, we present a direct cover of the class, following from the structural result of Theorem 
[5] At a high level, we grid over the Oik"^) parameters of the Gaussian component with granular- 


34 







ity poly(e/fc)/n, and the poly(A:/£) parameters of the {tk‘^,k)-FMD with granularity poly(e/A:), 
resulting in a cover of the claimed size. 

Proof of Lemma\^ Our strategy will be as follows: Theorem [5] implies that the original distribution 
is 0(e) close to a particular class of distributions. We generate an 0(e)-cover for this generated 
class. By triangle inequality, this is an 0(e)-cover for (n, /c)-PMDs. In order to generate a cover, 
we will use a technique known as “gridding”. We will generate a set of values for each parameter, 
and take the Cartesian product of these sets. Our guarantee is that the resulting set will contain 
at least one set of parameters defining a distribution which is 0(e)-close to the PMD. 

First, observe that we can naively grid over the set of A;)-PMDs. We note that if two 
CRVs have parameters which are within ±| of each other, then their total variation distance is at 
most e. Similarly, by triangle inequality, two PMDs of size k'^t and dimension k with parameters 
within of each other have a total variation distance at most e. By taking an additive grid of 
granularity over all k‘^t parameters, we can generate an 0(e)-cover for PMDs of size k'^t and 


dimension k with O 




kH 


candidates. 


Next, we wish to cover the Gaussian component. For a block, we will use Hi and Sj to refer to 
the mean and covariance, n* to the sum of the means within the block, and Si to refer to the set of 
coordinates. It will actually be more convenient to think of Sj in terms of a Cholesky decomposition 
which is guaranteed to exist since Sj is symmetric and positive semidefinite. We describe 
how to generate a O (|)-cover for a single block. We will prove that the underlying (continuous) 
Gaussians are 0(f) close, the closeness of the corresponding discretized versions follows by Lemma 
[T3l By taking the Cartesian product of the cover for each of the blocks and applying the triangle 
inequality, we generate a 0(e)-cover for the overall Gaussian at the cost of a factor of k in the 
exponent of the cover size. 

First, we examine the size parameter n,. Since the size parameter is an integer between 0 and 
n, we can simply try them all, giving us a factor of n in the size of our cover. 

Covering the mean and covariance matrix takes a bit more care. We use Proposition [3] to 
analyze the error incurred by inaccurate guesses for these parameters. We let Mi be the Gaussian 
corresponding to a single block of our Gaussian, and we will construct a M 2 which is close to it. 
By Theorem [5l we know that M > ^ 3 . 

We examine the first term of the bound in Proposition [3l If \\fii — // 2 II 2 < ■= /3, then this 

term is O (f). Note that G [0, since each coordinate is the sum of at most n parameters 

which are at most 1. We can create a /3-cover of size O ( ) for this space, with respect to 

the £2 distance. To see this, consider covering the space with (|S'i| — l)-cubes of side length . ^ 

Y I Si I — 1 

Any two points within the same cube are at £2 distance at most /3. Taking a single vertex from 

the (|5j| — l)-cube, our cover is defined by taking the corresponding vertex from all the cubes. 

/ \|Si|-l 

The volume of is and the volume of each cube is ( ,, f. 1 , so the total 


number of points needed is 
|Si|-l 


VlSil-i 

/3 


|Si|-l 


Substituting in the value of j3 shows that a set of 


size O ( 

ecy/t 


( l.3/2\|Si|—1 

= O J suffices to cover the mean of the Gaussian to a sufficient 

accuracy. 

Next, we examine the second term in Proposition [3j Taking a < ||| sets this term to be 
O ( 1 ). However, we can not naively grid over the matrices, since the covariance matrix is required 


'Recall that the Cholesky decomposition implies that Li will be lower triangular. 
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to be PSD. Therefore, we instead grid over entries of a Cholesky decomposition. Observe that 
the diagonal entries of the true covariance matrix are equal to the £2 norms of the rows of the 
true Cholesky decomposition. Since the maximum entry in the true covariance matrix is at most 
n, this implies that the magnitude of the maximum entry of the true Cholesky decomposition is 
at most y/n. If we grid over the entries of the Cholesky decomposition with granularity 7, there 
will exist a candidate where all entries are within ±7 of the true entries. Using the bound of ^/n 
on the maximum element and liSil — 1 as the dimension of the matrix, this will imply that the 
entries of the resulting covariance matrix are within ±(27-^71 + 7^)(|S'jl — 1) < A'y\Si\^/n of the 
true entries. Since we want this value to be upper bounded by a, it gives that 7 < §^6^5*1^ - 

Combining this gridding granularity with the fact that there are at most ' ^ non-zero entries 

in the Cholesky decomposition, which are in the range [—y/n, ^yn\, this implies a cover of size at 

“ost O 


Combining the gridding for the size, mean, and covariance, a O (|)-cover for one block is of 


size 





O 


ect 


h\Si?-h\Si\ , , , / u \0{\Si\^) 

= nh\S^\^+2\Si\ I A] 

ect J 


Taking the Cartesian product over all the blocks of the Gaussian and noting this function is convex 
in the values of {| 5 'il}, we cover the entire Gaussian with a set of size at most 


712 


-) 

ect) 


0 {k?) 


Combining the cover for the Gaussian component and the {tk'^, A:)-PMD gives us a cover of size 

k?t 


ect) V e / 


77 , 2 '' + 2 '' 


Substituting in the values of c and t gives us a cover of size 




^26+7 


for constants (5i,(52 > 0, which satisfies the statement of the theorem. 


□ 


C.2 A Sparser Cover 

In the previous section, we chose a naive gridding for the sparse component, resulting in a cover 
size which is exponential in poly(A;/e). In this section, we present the cover described by Lemma [ 7 l 
which is of size exponential in log^^^(l/e). We use a moment matching technique similar to 
Roos [R00O2] . In this work, Roos showed that any generalized multinomial distribution can be 
written as a weighted summation of derivatives of a simple multinomial distribution. To describe 
his theorem, we define 14 (n) = {u G : Uj > 0 A Vi < n}. 

Lemma 20 (Theorem 1 in [Roo 02 ] ). For an arbitrary vector q with <1, the density of the 
generalized multinomial distribution at any point x can be expressed as: 

A a«(^)A“AI(n - lu|,4,x) 

ueVkin) 
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where M.{n,q,x) represents the density of the multinomial distribution with probabilities q at point 
X and au{q) is the eoeffieient of the term expansion of the polynomial: 


i=i 


i=l 

Roos also showed that considering fewer terms in the summation above provides a good ap¬ 
proximation to the density of the original generalized multinomial distribution. We consider the 
approximator: 

mw,q{x)= ^ a„(^A“>f(n - |M|,g,x) 

u&Vk{w) 

Lemma 21 (Theorem 2 in |Roo 02]). 

Q,W + l 

\\MP - < -- for a<l 

1 — a 

where 


a = ~ + (Er=i(p(bi) - Qj)y 

^ V 


We will use these results to produce a sparser cover. We will first show that for a particular 
class of generalized multinomial distributions, there exist good approximators. 

Lemma 22. Consider a generalized multinomial distribution M^. If for all j £ [k], it holds that 
{maxi p{i, j) — vaiTii p{i,j)\ < {Aek‘^)~^ and moreover P(b 0) > then: 

'\ — W 


\\MP - < 2 “ 


for the vector q with qj = - p{hj) 


Proof. Since according to Lemma [21] the l\ distance of to the approximator is at most 
it suffices to show that a < ^. 

By our choice of g it holds that '^^=i{p{i,j) ~ Qj) = 0- Therefore, we have that: 

^ V f^i V 

since go ^ p Moreover, we have that ^ I ma-Xj/9(z, j) — m.m.i p{i, j)\ < (4eA:^)”^. 

Plugging this bound in the above expression for a gives the desired bound. □ 


We now show that if two PMDs have matching moments then their approximators are the same. 
This will allow us to compare the total variation between them by looking at their distance to the 
common approximator. 

Lemma 23. Consider two generalized multinomial distributions MP, Mp' and their approximators 
m.w,q If for all u £ Vk{w) and j £ [k]: 

n k n k 

2=1 j = l 2=1 j = l 

then 
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Proof. We first note that if the condition holds for all u G Vk{w), then it also holds that for all 
u G Vk{w): 

n k n k 

E n W'-j) - "j)”’ = E !!('>'(“■ j) - 

i=l j=l 2=1 j=l 

This is because when expanding the product 0^=1 (p(bj) “ Qj)^^ treating it as a polynomial 
in Qj, the coefficients in each term are a polynomial of degree at most w in the p{i,j) and summing 
over all i we get that the two sides are equal. 

We now define p{i,j) = pihj) — Qj and note that according to Lemma EOj the coefficients of 
the approximator rriw^q are given by the expansion of the polynomial: nr=i (i+E-=iP(bj>,). 

We observe that for any given u, the coefficient of the term zj-’ is a degree |ii| polynomial 

in terms of p{i,j) which, by the theory of multisymmetric polynomials, can be written entirely as 
a polynomial of the elementary multisymmetric polynomials, 0^=1 for G Vk{w). 

Since and are equal in all those terms, it means that they have equal coefficients au{^ 
and thus their approximators are the same. □ 

Using those two lemmas, we can construct a cover for fe)-PMD which has an exponentially 
better dependence on 1/e. We must cover at most k'^t CRVs, which we can assume each have 
probabilities that are multiples of By the previous section, this induces a cost of 0(e) in total 
variation distance. To apply Lemma [22l we will first partition the CRVs into (4eA:^)^ groups. In 
particular, consider indexing the groups by u G [Aek^]^. In group v, we include all CRVs with 
mean vector p where pj G j^[vj — l,Uj] for all j G [k]. For the PMD induced by each group, we 
have the property | maxj p{i,j) — min* p{i,j)\ < {Aek^)~^. We cover each such PMD separately by 
considering all possible different moment profiles that it can achieve. A moment profile for a PMD 
of size n is a vector of |I4(r(;)| elements, where the entry of the profile indexed by u G Vk{w) is 
equal to 0^=1 ■ By Lemma[23]if two PMDs have the same moment profiles they have 

the same approximator and thus by Lemma [22] and triangle inequality their total variation is at 
most 2“’^+^. 

We now count how many different moment profiles are possible to arise. For a given u, there 
are at most k^^^ different values when |u| = 1, values for |u| = 2, and in general 

values when |tt| = i. Since there are at most vectors with |u| = i, there are at most 



different moment profiles. By picking w = A:log(^^), we get small enough error so that union 
bounding over all (4eA:^)^ different groups will still give an e error. This means that by considering 
only a single PMD for each moment profile in each of the (4eA:^)^ groups, we can create an e-cover 

of size ^ ^ ^ ® concluding the proof of Lemma[7l 

D Details from Section [5] 

D.l Estimating the mean and covariance of a PMD 

We will prove an analogue of Lemma 6 in |DDS12| . i.e., that we can accurately estimate the mean 
and covariance of a PMD with a small number of samples. First, we will show that we can get 
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accurate estimates of the moments in any particular direction we desire. Then, taking the union 
bound over directions, we show that our estimate is accurate for all directions simultaneously. 

Lemma 24. For any vector y, given sample access to a (n, k)-PMD X with mean p and covariance 
matrix S, there exists an algorithm which can produce estimates jl and T, such that with probability 
at least 9/10.- 


- h)l < and - S)y| < 

The sample and time complexity are 0(l/e^). 

Proof. We start with the estimate ft. Let Zi,..., Zm be independent samples from X, and let 
A = Then 

T Var[y'^fi] = —VarVy^X] = ^ 

m m 

Then by Chebyshev’s inequality, 

Pr[|y'^(/i - y)\ > t^/y'^T.y/y/m] < 

Choosing t = vTo and m = [10/e^], the above imply that |y^(/i — y)\ < £\/with probability 
at least 9/10. 

Next, we describe S. Let Zi,..., Zm be independent samples from X, and let the empirical 
estimator for the covariance be S = ~ m Si ~ ^ Si ■ Then it can be shown 

that |.Tohll| : 

E[y'^ty] = y'^T,y and Var[y^t,y] = {y'^Tyf f —^ 

\m — 1 m J 

where Ky is the excess kurtosis of the distribution of X with respect to the vector y (i.e., Ky = 
E[{yT(X-Fn _ 

[y'^Y.yP h 

It can be shown that: 

h ’ 

where Xi is the fth CRV in the PMD. We note that y"^{Xi — p) < 2||y||2. This is because ||-Tj ||2 = 1, 
||/i||i = 1 and 11^112 < \\p\\i. Therefore - p))'^ < 4||y|||(y^(W - and thus 

. ^ ^y\\lE[{y'^{Xi- p))^] ^ WyWly'^Ty ^ Ay'^y 


Therefore, Var[y'^ty] < (y^Tyf + m{$'Ey) ) 

shev’s inequality. 


< 




1 + 


y'^y 

y'^'^y 


Pr 


\y'^{t-T.)y\>t 


2y^Ey 

y/fn 




Again using Cheby- 


Taking t = -\/l0 and m = [dO/e^], the above imply that \y'^{T — T)y| < ey'^Ty^Jl + with 
probability at least 9/10. □ 
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Lemma 25. Let S, S G be two symmetric, positive semi-definite matrices, and let (Ai, t;!), ..., {Xk,Vk) 
be the eigenvalue-eigenvector pairs ofTi. Suppose that 


Forallie[k], 


S - S 


VXi 


< e, 


• For all i,j G [k]. 
Then for all y G 


+ x/xj + VA, 

( 


< 4e. 




< 3/cey^Sy. 


Proof. Without loss of generality, we can focus on the case Ti = I, with eigenvalue-eigenvector pairs 
(IjCj) for all j G [fe]. To see this, write S as its eigendecomposition QAQ^, and replace y with 
Qhr^/'^x, which will place the matrix S in “isotropic position.” 

We now have the guarantees 

• For all i € [/c], \ef (E — /)ej| < e, 

• For all i,j G [k], \{ei + ej)'^{t - I){ei + ej)\ < 4e, 

and we wish to show |y^(S — I)y\ < 3A;e||y||| for all y G 
We need the following proposition: 

Proposition 11. For any vector x G and matrix A G 

x'^Ax = ^ -I- CjY'A{ei + efi + 2'^x‘feJAa - '^^XjcJAg, 

i^j i i i 

where Ci is the ith standard basis vector. 

Proof. Observe that 

— ^ ^ XiXjicj, -|- efi A(^ei -\- efi — — ^ ^ XiXji^Aa T Ajj T Aij -|- Ajf) 

= XiXjAij + ( ( X] 

i^j i j^i 

Adding the 2 xjef Aci term gives us 

^ ^ XiXjAij P ^ ^ ^ ^ ^ ^ XiAiiJ — ^ ^ XiXjAij + ^ 'y ^ Xj^ ^ y ^ XiCj^ Aci 

i,j i i i,j i i 

Subtracting the final term gives the desired result. 

We apply this to |y^(S — I)y\, giving 

y^{t-I)y = i^yiyj(ei+ej)'^(S-/)(ei-bej)-F2^yfef(S-I)ei-(^^yi) (^S-I \e 

i^j i i 


□ 
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Using the guarantees in the lemma statement, 


I ^ - I){ei + Cj) + 2 ^ yfef (S - I)ei - ( X] ^ 

i i i 

<e{2^\yi\\yj\+2'^yf+(^^y^ ) 
i^j i i 

= £{^'^\yi\\yj\ + II2/II1) 

ij 

= 3e\\y\\l < 3ke\\y\\l 

where the final inequality is Cauchy-Schwarz. □ 

Lemma 8. Given sample access to a (n,k)-PMD X with mean y and covariance matrix T, (with 
minimum eigenvalue at least 1 ), there exists an algorithm which can produce estimates fl and S 
such that with probability at least 9/10; 

ly'^ih - h)\ < and \y'^{t - T,)y\ < sy'^T.y 


for all vectors y. 

The sample and time complexity are 0{k'^/e^). 

Proof. The proof will follow by applying Lemma [Ml to k"^ carefully chosen vectors simultaneously 
using the union bound. Using the resulting guarantees, we show that the same estimates hold for 
any direction, at a cost of rescaling e by a factor of k. 

Let S be the set of k'^ vectors {uj} and + for all (i,j) € [A:] x [k], where {(Aj,Ui)} are 

the (unknown) eigenvalue-eigenvector pairs of S. From 0{k^j^') samples, with probability 9/10, 
we can obtain estimators fi and S such that 

|2/^(A - t)\ < and \y^{t - T.)y\ < 

This follows by Lemma [Ml the eigenvalue condition on S, and an application of the union bound. 

We first prove that the mean estimator fi is accurate. Consider an arbitrary vector y, which 
can be decomposed into a linear composition of the eigenvectors y = Mi otiVi- 




where the last inequality is Cauchy-Schwarz. Since — y'^'^y^ this proves the desired 

accuracy bound for the mean’s estimator. 

The accuracy of S follows from an application of Lemma [25l □ 


D.2 Rounding preserves the mean and covariance 

In order to convert our estimate of the covariance matrix for the PMD to an estimate of the 
covariance matrix for the Gaussian component, we first need to understand how much the rounding 
step affected the covariance matrix. We will use the fact that the unrounded GMD we are sampling 
from and the rounded GMD we want to estimate are e-close in total variation and show the following 
lemma: 


41 






Lemma 26. Suppose there exist two e-close {n,k)-GMDs with covariance matrices Si and S 2 , 
where the minimum eigenvalue of Si is at least 1/e^. Then for any vector y, ly^(Si — T, 2 )y\ < 
9ey^Siy. 

Proof. Since the variance of the GMD with covariance matrix Si is at least 1/e^ when projecting 
to direction y, we can apply the Berry-Esseen theorem (Proposition [4|) to show that it is close in 
Kolmogorov distance to a Ganssian with the same mean and variance y^Siy. To do this, we first 
re-center the GMD by snbtracting the mean from each summand and projecting in direction y 
with ||y ||2 = 1. This gives us a sum of n independent random variables that lie in [—\/2, \/2]. This 
implies that pi < V^crf and Proposition 0] gives the Kolmogorov distance induced to be: 

_ I _=_i _= ^3/2 < 

(l/^Siy)V2 

We will now show that the variance of the second GMD along direction y needs to also be at 
least 1/e^, in order for the two GMDs to have total variation distance less than e. We assume that 
this is not the case, for the sake of contradiction. 

Consider the random variable Y that is distributed according to the second GMD in direction 
y. By Chebyshev’s inequality, we have that: Pr[|y — E[y]| > < e. However, the first 

GMD has D(l) probability mass distributed outside the interval one standard deviation from its 
mean, since it is well approximated in Kolmogorov distance (and thus, by Factdl in total variation 
distance) by a Gaussian. Therefore, the two GMDs are D(l)-far, which is a contradiction. 

Now, since the second GMD has minimum variance at least 1/e^, we can also approximate it by a 
Gaussian as before using the Berry-Esseen Bound, losing e in Kolmogorov distance. Proposition [12] 
then implies that in order for the total variation distance between the two to be at most 3e, we 
must have that \y'^{Tji — E 2 )y| < □ 

Proposition 12 . For two single dimensional Gaussians Mi = M{pi,af), M 2 = M{p 2 i^‘ 2 ) such 
that ^ 0 (1 — e, 1 -|- e), it holds that 

dK{M{pi,al),M{p.2,cr2)) > |. 

Proof. Without loss of generality, suppose pi < p 2 and ai < 02 . Consider the point x = pi-\-y/2ai. 
At this point, the CDF of the first Gaussian equal to ^(l-l-erf(l)). Similarly, the CDF of the second 

Gaussian is at most ^(1 -|-erf(|^)) < ^(1 -|-erf(l — e)). Therefore, d]^{Mi,M 2 ) > > | 

where the last inequality holds for all e G (0,1). □ 

Applying Lemma [26| implies that our estimate for the PMD’s covariance matrix is also a good 
estimate of the covariance matrix after applying the rounding procedure described in Section IB. 11 
Moreover, the mean is preserved almost exactly since, by construction, there is a small additive 
error of c in each coordinate. Since the minimum eigenvalue of the PMD’s covariance matrix is at 
least 1, this additive error is negligible. 

D.3 Converting moment estimates from the PMD to the Gaussian 

In the previous sections, we showed how to estimate the moments of the rounded PMD. However, 
we can not use these estimates to obtain the moments of the Gaussian component of the structure 
directly. The problem is that since the rounded (n, A:)-Poisson multinomial random vector might be 
the sum of a Gaussian and a (tfc^, /c)-Poisson multinomial random vector, the empirical mean and 
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covariance of the samples might be very different than the mean and covariance of the Gaussian 
component we want to estimate. In this section, we show how to convert our estimates to accurately 
describe the Gaussian component by appropriately guessing the error induced by the non-Gaussian 
component. 

Let (/i, S), (/xg, Sg), (/r 5 , Ss') be the means and covariance matrices of the (rounded) (n,/c)- 
PMD, of the Gaussian component, and of the {tk‘^, /c)-PMD respectively and (/i, S) be the empirical 
mean and covariance matrix we estimated in the previous section. It holds that /i = + k-S and 

S = Eg + S5. 

By Lemma [ 8 ] and Lemma [26l after taking 0{k'^fe^) samples, with high probability, we have 
that for all vectors y, — /r)| < £\/and |y'^(E — E)y| < ey'^'Ey. We show how to correct 

our estimate (/i, E). In particular, we will generate a set of candidates which contains an estimate 
(/ 1 g,Sg) such that for all vectors y, \y'^{fiG - k-G)\ < ^Vv^^GV and |?/'^(Eg - ^G)y\ < ey^'^GV- 
We do this without any additional samples, by carefully gridding around the estimated mean and 
covariance. 

To achieve the guarantee for the covariance matrix, we compute a sparse cover of the space of 
all PSD matrices around E. 


Definition 11. Let S be a set of symmetric kx k PSD matrices. An e-cover of the set S, denoted 
by S^, is a set of PSD matrices such that for any matrix A G S, there exists a matrix B £ Se such 
that for all vectors y: \y'^{A — B)y\ < ey'^Ay. 

Using the fact that |y'^(E — E)?/| < ey'^Hy and |y'^(E — Eg)?/! = \y'^T,sy\ < my'^y, we know 
that |y^(E — Eg)?/! < j^y'^^y + my'^y < 2ey^Ey + my'^y. This means that in order to get an 
estimate Eg such that for all directions y, |y^(SG ~ ^G)y\ < ^y'^^Gy: it suffices to consider an 
e-cover of the PSD matrices A that satisfy the property |y^(E — A)y\ < 2ey^Sy -|- my'^y for all 
vectors y. The following lemma gives an efficient construction of the cover and bounds its size. 


Lemma 9. Let A be a symmetric k x k PSD matrix with minimum eigenvalue 1 and let S be the 
set of all matrices B such that \y'^{A — B)y\ < siy"^Ay + e 2 y'^y for all vectors y, where ei € [0,1/4) 

and 62 £ [0, 00 ). Then, there exists an e-cover S^ of S that has size ^ ) 


Proof. To construct the cover, we will make use of the eigenvalues and eigenvectors of the matrix 
A. We first show that for any matrix B £ S, its eigenvalues are close to the eigenvalues of A. 

Proposition 13. Let A, B be two symmetric k x k PSD matrices such that for all vectors y with 
||y|| = 1, ly'^iA — B)y\ < eiy'^Ay -£ £2 for some constants £ 1,62 > 0. Then for the eigenvalues 
of A, and the eigenvalues Xf < ... < A® of B, it holds that: 


|Ai^ - Af I < eiXf + £2 


Proof. From Gourant’s minimax principle, we have that the z-th eigenvalue of A is equal to: 

\ A • T A 

Xa = max min x Ax 
C 

\Cx=o) 


where G is an (i — 1) x /c matrix. For the matrix B, we have that 


Af = max min Bx < max min (1-|-£i)a:^ -I -£2 = (1 + £i)Af -|-£2 

* G /lbll=i'| “ G Zlkll=if ^ z V * z 

\Cx=oJ Kcx^o) 

Similarly, we have that Af > (1 — £i)Af — £ 2 , so the result follows. 


□ 
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This means that by computing the eigenvalues < Hk oi A and then guessing 8^2 possible 

values to subtract in the range [— 62 , 62 ] with accuracy 1/4, we can get estimates of the eigenvalues 
Ai,Afc of B within a multiplicative factor of l±l/2. This is true because the minimum eigenvalue 
of B is at least 1. We can improve our estimates to a better multiplicative factor 1 ± 6 by gridding 

'1+1/2 A _ 


multiplicatively around each eigenvalue. This requires another log;^.,.^ (^ 1 - 1 / 2 / ~ 0{l/e) guesses 

per eigenvalue. So in total, we require (guesses for obtaining accurate estimates A'^,..., A'^ 
of the eigenvalues of B. 

Once we know (approximately) the eigenvalues of we will try to guess also its eigenvectors 
vi, ...,Vk- We will do this by performing a careful gridding around the eigenvectors of A which we 
can assume, without loss of generality (by rotating), to be the standard basis vectors ei, 62 ,..., e^. 
So for each eigenvector Vz of B, we will try to approximate it by guessing its projections to the 
eigenvectors of A. 

We now bound the projections of eigenvectors of A to eigenvectors of B. Since we know that 
ejBci < (1 + ei)ef Aei + 62 , we get that < (1 + ei)/ii + 62 which implies that 

Vz,i < 2;^^+£2 ^ Moreover, since A^ > max{(l — 61 )//^ — 62 , 1 }, we know that the projection of Vz to 

Cj will be smaller than 1 } ' additional bound for the projection of Vz to Cj can be 

obtained by considering the variance of the matrices A and B in the direction Vz- Since we know 
that v’^Bvz > (1 - ei)v'^Avz - £ 2 , we get that '^iHiivzGiY < {\z + 62 ) < 2 (A 2 + 62 ) which 

implies that Vz^i < 


) ^ 2+^2 




We now guess vectors u(,...,u(, that approximate the eigenvectors of B by additively gridding 
over the projections to each eigenvector of A. To get an approximation of the eigenvector Vz, we 
grid over a projection to e, with accuracy e' min 1 } ’ ^ small enough e' that 

only depends on k, 61 and 62 . This requires p guesses for each projection, and thus guesses 

for all projections. The final covariance matrix we output is then B = A(,u(,(u(,)^. 

We will now show that the covariance matrix B satisfies the property that it is close in all 
directions to B. To do this we will make use of Lemma [25l and only consider directions y = 

V 

for 2: S [A:] and y = for z, z' G [k]. 

We now consider direction y = We have that: 

V A2 

= Erl-*-*)" = 

V z y z I z ■ Z Z 2 

The first term is in the range [(1 — 6)(1 — ke')'^, (1 + 6)(1 + A: 6 ')^], which for s' < s/k, becomes 
(1 ± 0(6)). The rest of the terms can be bounded as follows: 
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-fivziv'i - Vi)f < (1 + 


< 




^l±^e'2 

fij 


ij'j + £2 


max{//j - 2^2, 1 } 


<(l + e) 




1+^2 


max{//i - 2e2,1} 
2 


< (1 + ^) ( X] “*" 


2\ 


max{/ij - 262, 1 } 


< (1 + e) ^ 4 A :(1 + 62)6 ^ 

< (1 + e) (8A:(1 + e 2 )\/ 62^0 — Z 


P-i + £2 


max{/xi - 262,1} 
6 
k 


for e' = 0(-ye((l + 62 )^:) This means that vJBvz G (1 — 6 ,1 + e)Xz. The proof is similar for 

directions y = for z^z' G [A:!. 

VA 2 V A^/ 


Overall, we can get an estimate B of any matrix S G 5 by making at most 
guesses, which implies an 6 -cover of this size. 


□ 


-) 


Applying Lemma [9] for 61 = 26 and 62 = m < tk"^, it is easy to see that we can get a good 
estimate Sg of Eg using only ^ guesses. This completes the analysis for obtaining an 

accurate estimate for the covariance matrix. The same approach also gives us an accurate estimate 
for the mean vector. We guess the projection of the mean on each of the (approximate) eigenvectors 
with accuracy proportional to the square root of the corresponding eigenvalue as in Lemma [9j This 
requires only additional guesses, so overall we can compute the estimates [Ilgi^g) using 

1 lk\0{k‘^) 

only (-) guesses. 


D.4 Probability Density Computation 

In order to apply Theorem^ we need access to a PDF comparator (Definition [9]). We will implement 
this by explicitly computing the probability mass function (PMF) of a distribution at a given point 
X. The naive computation could require time which is polynomial in n or exponential in 1/e. We 
will show how to avoid these costs using a dynamic program. 

Lemma 27. There exists an algorithm which computes the probability mass function for the con¬ 
volution of a discretized Gaussian with a (poly(fc/ 6 ), k)-PMD at a given point x in time {k/e)^^^'^. 

Proof. Let G{-) and Gd{-) be the PDF and PMF of the non-discretized and the discretized Gaussian, 
respectively. Similarly, let PMD{-) be the PMF of the (poly(A:/ 6 ), A:)-PMD. For any given integer 
point x, we can compute Gd{x) by computing the integral of the non-discretized Gaussian in a unit 
box around x, i.e. by letting R{x) = “ 1/2, Xj + 1/2], we have that Gd{x) = G{t)dt. We 

can compute this integral with very high accuracy using numerical integration methods. 
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To compute PMD(x), we use dynamic programming. We will maintain the variables P{x,i) 
that give us the probability at the point x in the support of the PMD considering only the first i 
CRVs. It is easy to compute P{x^ i) as Yhj Pi,jP{x ~ ~ 1)) where pij is the probability the i-th 

CRV assigns to coordinate j. Since there are points in the support of the PMD and at 

most poly(A:/e) CRVs in the PMD, we can compute the probability density for the whole support 
of the PMD in time 

To compute the probability density at point x for the convolution of PMD{-) with we 

write it as: PP^P{y)Gd{x — y)- We only need to consider the summation for points y in the 

support of the (poly(fc/e), fc)-PMD. Since there at most (k/e)^^^'^ such points the lemma follows. □ 


E Details from Section [6] 


We first recall the main structural result from (DDOZTs]: 

Lemma 10 (Corollary 4.8 of [DDOZis]). Let S = Xi + ••• + V„ be a {n,k)-SIIRV for some 
positive integer k. Let fi and cr^ be respectively the mean and variance of S. Then for all e > 0, 
the distribution of S is O{e)-close in total variation distance to one of the following: 

1. a random variable supported on ^ consecutive integers with variance < 15(A:^®/e®) log^(l/e); 
or 


2 . 


the sum of two independent random variables Si + cS 2 , where c is some positive integer 
1 < c < k — 1, S 2 is distributed according to [AA(/i, cj^)], and Si is a c-IRV; in this case, 


cj^ = n 




log2(l/e)j. 


Now, we provide learning algorithms for the two cases, corresponding to Lemmas 5.1 and 5.2 
in [DDO+13] . 

Lemma 28. There is a procedure Learn-Sparse with the following properties: It takes as input an 
accuracy parameter s' > 0, and a confidence parameter 6' > 0, as well as access to samples from 
an {n,k)-SIIRV S. Learn-Sparse uses m = • 0(log*^^^(l/e) log samples from S and 

has the following guarantee: If the variance of S is at most 15(A:^®/e'®) log^(l/e'), then we return 
a hypothesis variable such that d^viS, Hf) = 0{s') with probability at least 1 — 6'. 

Proof Let 5 = '£7=1 Si, and be a (n, A:)-PMD such that (1,..., = 

Si for all i. We will apply the rounding procedure described in Section [B. II on to argue that 

S is close to a shifted (poly(A:/e'), A:)-SIIRV. This will be sufficient to complete the proof, as we 
can e'-cover S by considering a cover of all unshifted (poly(A:/e'), A:)-PMDs when projected onto 
(1,...,A:), which, by Theorem [2l is a set of size N = xhe shift 

is determined by taking 0(1/e''^) samples and trying all integers within an additive poly{k/s') 
of the mean of these samples. We select one of these hypotheses using Theorem [71 requiring 
^ log V = k^^ ■ 0(log^’''^(l/e) log l/(5'/e'^) samples, as desired. 

Let T^^^ be result when the rounding procedure of Section E His applied to , and let 

T = '£7=1 Pi where T, = (1,..., k£ ■ Tf^^. By Lemma [T] and the Data Processing Inequality 
(Lemma [13]), this tells us that 


^TV {S, T) < C^TV ('S' 


PMD rpPMD 


)<0 


a/2^5/2i^gl/2 =0{e'), 
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where the last equality follows by our choice of c. We prove by contradiction that T’s variance is 
still poly(/c/e'). Suppose not, that where •= poly(A;/e'). We apply the Berry-Esseen 

theorem (Proposition [3|) to T', which is a re-centered version of T. Defining T[ ^ Ti — E[Ti], and 
observing that T' G [—k,k] we note that ht' = 0, = cj|n, pnp: < ka^. Thus, 

dK{T,M{pT,CI'T)) ^ 

By triangle inequality and the fact that total variation distance upper bounds Kolmogorov distance, 

dTv{S,J\f{pT,(^T)) ^ 0{e') + 

However, anticoncentration of a Gaussian tells us that for any point x, 

Pr(lAf(fiT, (Tt) -x\<£)< 

Examine the interval of width k^j2e'‘^ centered at E[S]. S assigns at least 1 — e' mass to this 
interval, but J\f{pT,o'T) assigns at most mass. If |(1 —e') — ^^/Cl > 0{e') + ^, which 

happens for = uj{k^/e'^), this interval demonstrates that the total variation distance is larger 
than we showed above, thus arriving at a contradiction. Thus, we have that the variance of T is at 
most 

By the rounding procedure, we know that the variance of any Tj which is non-constant is at 
least c(l — c) > c/2. Since variance is additive and the variance of T is at most this implies 
that there are at most 2C^jc = non-constant Ti. Therefore, S is e'-dose to a shifted 

(0(A;24/e'i^), yt)-SIIRV, as desired. □ 

Lemma 29. There is a procedure Learn-Heavy with the following properties: It takes as input a 
value ^ G {1,..., A: — 1}, an accuracy parameter e' > 0, a variance parameter and 

a confidence parameter 6' > 0, as well as access to samples from a poly{n)-IRV S. Learn-Heavy 
uses m = -I- log(l/(5'))) samples from S, runs in time 0{m), and has the following 

performance guarantee: 

Suppose that dT:y{S,lZ -I-T) < e’ , where Z is a discretized random variable distributed as 
[J\f , for some a'‘^ > a‘^ , Y is a i-IRV, and Z and Y are independent. Then Learn-Heavy 

outputs a hypothesis variable such that dT\/{S, H^) < 0{e') with probability at least 1 — 5'. 

Proof. This follows similarly to the proof of Lemma 5.2 in [DDO~*~l^ . Y and Z are learned in 
separate stages. Y is learned identically as in their algorithm, by using the empirical distribution 
of 0((l/e'^)(^ -|- log(1/5'))) samples reduced to their residue mod £. 

Learning Z is performed differently. We take 0(log(l/5')/e'^) samples and replace each value v 
with the value \y/i\. In other words, given samples from S, we simulate samples from Z' = yS/i\. 
Since d^Y{S,IZ + Y) < e', Lemma fT3l implies that (iTv(■^^■^) < which in turn implies that 
dK{Z', Z) < e', using Eact [TJ 

Using Lemma m our samples from S give us a distribution Z' such that dK{Z',Z') < e' with 
probability 1 — 5'. 

We make the following straightforward observation, bounding the Kolmogorov distance between 
a Gaussian and the corresponding discretized Gaussian. 

Proposition 14. d\<^{N{pi,a‘^), 
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Using triangle inequality and the lower bound on cr^, this tells us that dK(■^^ ^)) < 0{£'). 

Now, we can apply the following robust statistics results from [DK14] : 

Lemma 30 (Lemmas 9 and 10 of [DK14j ). Let F be a distribution such that F) < e. 

Then 


• med{F) = F e /r ± 0{£a) 


^ IQRjF) 
2U2er/-l(i) 


A 


2U2er/-l(i) 


G (7 ± 0{£a) 


By taking the median and a rescaling of the interquartile range of Z, we get estimates fi' and a'"^ 
which are within ±0(e^) of the true parameters. Proposition[2]implies dTY{Af{fi', ^)) < 

O(e'). Applying Lemma [13] gives us dTv(L■^(/^^ ^ O(e'). The result follows using triangle 

inequality on the estimates for Y and Z. □ 


Now, we run Learn-Sparse once, and Learn-Heavy for c = 1 to fc — 1. This will give us a 
set of k hypotheses, at least one of which is close to the true distribution. We use the subroutine 
FastTournament (as described by Theorem[7|) to select one of these hypotheses. Theorem |4| follows 
by combining Lemma [10] with the guarantees provided by Lemmas [28] and [29] 
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